A tutorial at ECMLPKDD 2020
Peter Flach, University of Bristol, UK, Peter.Flach@bristol.ac.uk , www.cs.bris.ac.uk/~flach/
Miquel PerelloNieto, University of Bristol, UK, miquel.perellonieto@bristol.ac.uk, https://www.perellonieto.com/
Hao Song, University of Bristol, UK, hao.song@bristol.ac.uk
Meelis Kull, University of Tartu, Estonia, meelis.kull@ut.ee
Telmo Silva Filho, Federal University of Paraiba, Brazil, telmo@de.ufpb.br
Abstract
This tutorial introduces fundamental concepts in classifier calibration and gives an overview of recent progress in the enhancement and evaluation of calibration methods. Participants will learn why some training algorithms produce calibrated probability estimates and others don’t, and how to apply posthoc calibration techniques in order to improve the probability estimates in theory and in practice, the latter in a Section dedicated to HandsOn explanations. Participants will furthermore learn how to test if a classifier’s outputs are calibrated and how to assess and evaluate probabilistic classifiers using a range of evaluation metrics and exploratory graphical tools. Additionally, participants will obtain a basic appreciation of the more abstract perspective provided by proper scoring rules, and learn about related topics and some open problems in the field.
Description
This tutorial aims at providing guidance on how to evaluate models from a calibration perspective and how to correct some distortions found in a classifier’s output probabilities/scores. We will cover calibrated estimates of the posterior distribution, posthoc calibration techniques, calibration evaluation and some related advanced topics. Among the main intended learning outcomes are the following. Participants will:
 understand the main advantages of calibrated classifiers, particularly in relation to changing misclassification costs and changing class priors;
 learn the major definitions of calibrated outputs in the field, as well as their relative relationship;
 understand why some training algorithms produce calibrated probability estimates and others don’t, and be able to apply calibration techniques in postprocessing;
 have a grasp of basic methods to evaluate probabilistic classifiers and be able to use graphical tools such as reliability diagrams and cost curves to analyse their performance in more detail;
 be introduced to a range of established and recently developed techniques to quickly obtain better calibrated results from trained models;
 learn how to use available calibration tools and the steps needed to train and evaluate a calibrated model in a HandsOn approach;
 learn about a few advanced and related topics and open problems, such as alternative views of calibration and other forms of uncertainty.
The tutorial will include practical demonstrations of some of the material by means of Jupyter Notebooks which will be made available online to participants in advance.
This tutorial will benefit machine learning researchers of different abilities and experience. PhD students and machine learning novices will profit from a gentle introduction to classifier calibration and achieve a better understanding of why good classifier scores matter. Only basic machine learning knowledge is expected (at the level of Mitchell or Witten & Frank or Peter Flach’s book, among others). More experienced machine learning researchers, who may already be familiar with the more basic material on calibration, will benefit from the comprehensive perspective that the tutorial provides, and perhaps be encouraged to tackle some of the open problems in their own research.
This tutorial is relevant to the ECMLPKDD community, with previous work related to calibration having been published and presented in past conferences [16, 18], including a best paper award in the ECMLPKDD 2014 conference for the paper on reliability maps by Kull and Flach [16]. Calibration and uncertainty quantification are also receiving growing attention among other major ML / AI conferences, such as ICML, NeurIPS (see figure below) and AISTATS, demonstrating that there is growing interest on the interpretability of classification model outputs in order to make better informed decisions.
Outline
This is a three and a half hour tutorial divided into five sections, with the final Section devoted to a recap and discussion of open problems. We first give the planned schedule with main contents in the following table. Then we briefly describe each of the five Sections below in separate paragraphs.
Section  Topics covered  Presenter  
45min 
1) The concept of calibration 

Peter Flach 
45min  2) Evaluation metrics and proper scoring rules 

Telmo 
30min  BREAK 


60min  3) Calibrators 

Hao Song 
30min  4) HandsOn 

Miquel 
30min  5) Advanced topics and conclusion 

Peter Flach 
The five sections are described in more detail in the following paragraphs.
1) The concept of calibration: We start by introducing the concept of calibration. A predictive model is wellcalibrated if its predictions correspond to observed distributions in the data. In particular, a probabilistic classifier can be said to be wellcalibrated if, among the instances receiving a predicted probability vector p, the class distribution is approximately given by p. This Section will cover different notions of calibration and how it can help with optimal decision making; exemplify what are some possible sources of miscalibration by means of simple examples; define the binary and multiclass scenarios together with corresponding visualisations; demonstrate how to obtain calibrated probabilities with simple techniques, such as binning methods; and propose different notions of multiclass calibration from the weakest to the strongest (confidencecalibrated, classwisecalibrated and multiclasscalibrated).
2) Evaluation metrics and proper scoring rules: Here, participants will learn how to evaluate the quality of classifier outputs from the calibration perspective. We start by introducing different losses starting from classification losses (eg. accuracy) and ending with proper losses (eg. Brier score). We will show that proper losses can be decomposed into calibration, refinement and other losses. We then explain the different versions of the expected calibration error (ECE), showing how they correlate with various levels of calibration and how they are related to some of the visualisation tools introduced in Section 1. We end this Section with hypothesis tests for calibration, with the null hypothesis being that the scores given by a model are already calibrated.
3) Calibrators: This section introduces both wellknown and recently developed stateoftheart techniques to improve the level of calibration, as well as practical details when applying them. The techniques are organised into two categories: (1) nonparametric approaches that can particularly benefit from large training sets; and (2) parametric approaches that are relatively fast to learn and apply, and show good performance. Established calibration methods include logistic calibration and the ROC convex hull method (also known as pairadjacentviolators or isotonic regression), while recently introduced calibration methods include beta calibration, which is designed for probabilistic binary classifiers; Dirichlet calibration, the natural extension of beta calibration to the multiclass scenario; and temperature scaling, vector scaling and matrix scaling, which were particularly designed for deep neural networks. We conclude this section by giving general advice about the application of different calibration methods, including regularisation techniques.
4) HandsOn course: This section consists of a HandsOn Course in which we cover existing Python packages and implementations of calibration techniques, while providing a series of Jupyter Notebooks available to be followed or run by the participants. The material will be made available beforehand and announced during the break for download or to be run online with Google Colab. The content will focus on a full pipeline on how to train and evaluate classifiers and calibrators for neural and nonneural models, the process to produce statistical comparison of calibration methods on several datasets, and also covers visualisation tools which will provide better insights into the strengths and weaknesses of uncalibrated classifiers and their calibrated counterparts (eg. reliability diagrams in a multiclass scenario).
5) Advanced topics: To conclude the tutorial, we will discuss open problems on calibration, and recent methods that may lead to innovative solutions. This will include the costsensitive perspective as an alternative view of calibration, with different scoring rules giving rise to different costbased assumptions. We will also briefly discuss calibration for regression tasks and other related tasks in uncertainty quantification, such as out of distribution (OOD) samples and error decomposition into epistemic and aleatoric losses.
Presenters
While the presenters are based at three different institutions in as many countries, they have a wellestablished and ongoing track record of working together. They also all have good to very close familiarity with the ECMLPKDD conference series.
Peter Flach (Peter.Flach@bristol.ac.uk) presents Sections 1 (Introduction) and 5 (Conclusion). He is Professor of Artificial Intelligence at the University of Bristol and has over 25 years experience in machine learning and data mining, with particular expertise in mining highly structured data and the evaluation and improvement of machine learning models using ROC analysis and associated tools. He was PC cochair of KDD’09 and ECMLPKDD’12 and authored (Machine Learning: the Art and Science of Algorithms that Make Sense of Data, Cambridge University Press, 2012, mlbook.cs.bris.ac.uk) which has to date sold about 15,000 copies and has been translated into Russian, Mandarin and Japanese. Since 2010 he has been EditorinChief of Machine Learning. He is a Fellow of the Alan Turing Institute and President of the European Association for Data Science. He has taught tutorials on inductive logic programming, ROC analysis and machine learning at ACML, ECAI, ECMLPKDD, ICML, UAI, and various summer schools. His current Google Scholar profile (https://scholar.google.com/citations?user=o9ggd4sAAAAJ) lists over 250 publications with over 11,000 citations and a Hirschindex of 51.
Telmo Silva Filho (telmo@de.ufpb.br) presents Section 2 (Evaluation metrics). He is an adjunct professor at the Department of Statistics of the Federal University of Paraiba (Brazil) and has over 10 years of experience in machine learning and data science, particularly complex data representations, optimisation, model evaluation and classifier calibration.
Hao Song (hao.song@bristol.ac.uk) presents Section 3 (Calibration methods). He is currently a postdoctoral researcher at the University of Bristol. His research interests are mainly on quantifying different types of uncertainties within the machine learning pipeline, particularly for different kinds of probabilistic outputs and corresponding evaluation metrics.
Miquel PerelloNieto (miquel.perellonieto@bristol.ac.uk) presents Section 4 (HandsOn). He is a Research Associate at the University of Bristol and has over 8 years experience in machine learning, artificial intelligence and data mining. He has held Research positions for the last 5 years while pursuing a PhD in Computer Science, and he has started and organised the PyData Bristol Meetup for the last 2 years which currently has ~900 members, and he leads monthly talks and workshops with an attendance of ~100 people per event. His research interests are on uncertainty evaluation of probabilistic classifiers, and its applications to semisupervised learning and learning in the presence of weak labels.
Meelis Kull (meelis.kull@ut.ee) is not currently planning to attend the conference due to possible calendar conflicts. He will however take active part in the preparation and organisation of the material. He is an associate professor at the University of Tartu, Estonia. His research interests cover topics in machine learning and artificial intelligence. He has recently been working on evaluation, uncertainty quantification and calibration of machine learning models, and on machine learning methods that tolerate varying deployment context.
Previous tutorials
Presenter Peter Flach has given many tutorials, courses and lectures on machine learning, including the following on evaluation methods, ROC analysis, probability estimation and contextaware knowledge discovery (presented with Meelis Kull and others) which are related (but not identical) to this proposal, which has not been presented in this form before.
 ICML’04 tutorial: The Many Faces of ROC Analysis in Machine Learning: http://www.cs.bris.ac.uk/~flach/ICML04tutorial/ (69 Google Scholar citations)
 UAI’07 tutorial: ROC Analysis for Ranking and Probability Estimation: http://www.auai.org/uai2007/tutorials.html#roc
 ECAI’12 tutorial: Unity in Diversity: The Breadth and Depth of Machine Learning Explained for AI Researchers: http://www.lirmm.fr/ecai2012/index.php?option=com_content&view=article&id=96 &Itemid=104
 INIT/AERFAI Summer School on Machine Learning 2013 and 2017: ROC Analysis and Performance Evaluation Metrics: http://www.init.uji.es/school2013/lecturers.html
 ECMLPKDD’16 tutorial: ContextAware Knowledge Discovery: Opportunities, Techniques and Applications: https://docs.google.com/presentation/d/1Q1_Wh8dcMDCH5DGuSxs_bieyIl8oDubmiYuZf8l9qu4/pub?slide=id.p
Required technical equipment
Participants will be able to follow the full tutorial just by means of the presenter’s projector screen. However, most of the material will be available online and some parts (eg. the HandsOn course) will be in the form of Jupyter Notebooks in case some of the participants want to run the Notebooks by themselves, or run them online with Google Colab.
References
An initial list in chronological order is given below. This includes work on forecasting and proper scoring rules [1,11]; foundational work on costsensitive learning and calibration [24,67]; ROC analysis and cost curves [5, 910]; empirical analysis [8, 12]; and recent advances [1325].
 Glenn Brier. Verification of forecasts expressed in terms of probabilities. Monthly Weather Review, 78:1–3, 1950.
 John Platt. Probabilities for SV machines. In A. Smola, P. Bartlett, B. Scho ̈lkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 61–74. MIT Press, 2000.
 Charles Elkan. The foundations of costsensitive learning. In Proc. 17th Int. Joint Conf. on Artificial intelligence (IJCAI’01), pages 973–978. Morgan Kaufmann, 2001.
 Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Proc. 18th Int. Conf. on Machine Learning (ICML’01), pages 609–616, 2001.
 Foster Provost and Tom Fawcett. Robust classification for imprecise environments. _Machine Learning_, 42(3):203–231, March 2001.
 Barbara Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Proc. 8th Int. Conf. on Knowledge Discovery and Data Mining (KDD’02), pages 694–699. ACM, 2002.
 Foster Provost and Pedro Domingos. Tree induction for probabilitybased ranking. Machine Learning, 52(3):199–215, 2003.
 Alexandru NiculescuMizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proc. 22nd Int. Conf. on Machine Learning (ICML’05), pages 625–632, 2005.
 Chris Drummond and Robert Holte. Cost curves: An improved method for visualizing classifier performance. Machine Learning, 65(1):95–130, 2006.
 Tom Fawcett and Alexandru NiculescuMizil. PAV and the ROC convex hull. Machine Learning, 68(1):97–106, 2007.
 Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, 2007.
 Chris Bourke, Kun Deng, Stephen Scott, Robert Schapire, and N.V. Vinodchandran. On reoptimizing multiclass classifiers. Machine Learning, 71(23):219–242, 2008.
 José HernándezOrallo, Peter Flach, and Cesar Ferri. Brier curves: A new costbased visualisation of classifier performance. In Proc. 28th Int. Conf. on Machine Learning (ICML’11), pages 585–592, 2011.
 José HernándezOrallo, Peter Flach, and Cesar Ferri. A unified view of performance metrics: translating threshold choice into expected classification loss. Journal of Machine Learning Research, 13(1):2813–2869, 2012.
 MingJie Zhao, Narayanan Edakunni, Adam Pocock, and Gavin Brown. Beyond Fano’s inequality: bounds on the optimal Fscore, BER, and costsensitive risk and their implications. Journal of Machine Learning Research, 14(1):1033–1090, 2013.
 Meelis Kull and Peter Flach. Reliability Maps: A Tool to Enhance Probability Estimates and Improve Classification Accuracy. In: Calders T., Esposito F., Hüllermeier E., Meo R. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2014. Lecture Notes in Computer Science, vol 8725. Springer, Berlin, Heidelberg, 2014
 Oluwasanmi O Koyejo, Nagarajan Natarajan, Pradeep K Ravikumar, and Inderjit S Dhillon. Consistent binary classification with generalized performance metrics. In Advances in Neural Information Processing Systems (NIPS’14), pages 2744–2752, 2014.
 Meelis Kull and Peter Flach. Novel decompositions of proper scoring rules for classification: Score adjustment as precursor to calibration. In Machine Learning and Knowledge Discovery in Databases (ECMLPKDD’15), pages 68–85. Springer, 2015.
 Peter Flach and Meelis Kull. Precisionrecallgain curves: PR analysis done right. In Advances in Neural Information Processing Systems (NIPS’15), pages 838–846, 2015.
 Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In AAAI Conference on Artificial Intelligence, 2015.
 Mahdi Pakdaman Naeini and Gregory Cooper. Binary classifier calibration using an ensemble of near isotonic regression models. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pages 360–369. IEEE, 2016.
 Meelis Kull, Telmo M. Silva Filho, and Peter Flach. Beyond sigmoids: How to obtain wellcalibrated probabilities from binary classifiers with beta calibration.Electron. J. Statist., 11(2):5052–5080,2017.
 Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On Calibration of Modern Neural Networks.InThirtyfourth International Conference on Machine Learning, Sydney, Australia, jun 2017.
 Juozas Vaicenavicius, David Widmann, Carl Andersson, Fredrik Lindsten, Jacob Roll, and Thomas Schön. Evaluating model calibration in classification. In K. Chaudhuri and M. Sugiyama, editors, Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 3459–3467. PMLR, 16–18 Apr 2019.
 Kull, M., Perello Nieto, M., Kängsepp, M., Silva Filho, T., Song, H. & Flach, P. Beyond temperature scaling: Obtaining wellcalibrated multiclass probabilities with Dirichlet calibration, 3 Sep 2019, NeurIPS 2019.