Calibration in Machine Learning Uncertainty Quantification: beyond consistency to target adaptivity

Reliable uncertainty quantification (UQ) in machine learning (ML) regression tasks is becoming the focus of many studies in materials and chemical science. It is now well understood that average calibration is insufficient, and most studies implement additional methods testing the conditional calibration with respect to uncertainty, i.e. consistency. Consistency is assessed mostly by so-called reliability diagrams. There exists however another way beyond average calibration, which is conditional calibration with respect to input features, i.e. adaptivity. In practice, adaptivity is the main concern of the final users of a ML-UQ method, seeking for the reliability of predictions and uncertainties for any point in features space. This article aims to show that consistency and adaptivity are complementary validation targets, and that a good consistency does not imply a good adaptivity. Adapted validation methods are proposed and illustrated on a representative example.

Reliable uncertainty quantification (UQ) in machine learning (ML) regression tasks is becoming the focus of many studies in materials and chemical science.It is now well understood that average calibration is insufficient, and most studies implement additional methods testing the conditional calibration with respect to uncertainty, i.e. consistency.Consistency is assessed mostly by so-called reliability diagrams.There exists however another way beyond average calibration, which is conditional calibration with respect to input features, i.e. adaptivity.In practice, adaptivity is the main concern of the final users of a ML-UQ method, seeking for the reliability of predictions and uncertainties for any point in features space.This article aims to show that consistency and adaptivity are complementary validation targets, and that a good consistency does not imply a good adaptivity.An integrated validation framework is proposed and illustrated on a representative example.a) Electronic mail: pascal.pernot@cnrs.fr

I. INTRODUCTION
][7][8][9][10][11][12][13][14][15][16][17][18][19] However, not all of these UQ methods provide uncertainties that can be relied upon, 20,21 notably if, as in metrology, one expects uncertainty to inform us on a range of plausible values for a predicted property. 22,23 pre-ML computational chemistry, UQ metrics consisted essentially in standard uncertainty, i.e. the standard deviation of the distribution of plausible values (a variance-based metric), or expanded uncertainty, i.e. the half-range of a prediction interval, typically at the 95 % level (an interval-based metric). 23,24The advent of ML methods provided UQ metrics beyond this standard setup, for instance distances in feature or latent space 13,25,26 or the ∆-metric, 27 which have no direct statistical or probabilistic meaning.3][34][35] Nevertheless, all UQ metrics need to be validated to ensure that they are adapted to their intended use.In this study, I focus on the reliability of variance-based UQ metrics for the prediction of properties at the individual level. 36e validation of UQ metrics is based on the concept of calibration.A handful of validation methods exist that explore more or less complementary aspects of calibration.A trio of methods seems to have recently taken the center stage: the reliability diagram 28,30 , the calibration curve 29 and the confidence curve 37,38 .They implement three different approaches to calibration which are not necessarily independent, but it is essential to realize that they do not cover the full spectrum of calibration requirements.In particular, none of these methods addresses the essential reliability of predicted uncertainties with respect to the input features, sometimes called individual calibration 39,40 .

A. Scope and limitations of the study
The aim of this article is to propose a complete validation framework for variance-based UQ metrics, based on the concept of conditional calibration and its complementary aspects of consistency (conditional calibration with respect to uncertainty) and adaptivity (conditional calibration with respect to input features).
It is well known that average calibration is not sufficient to establish the reliability of ML-UQ predictions.This study goes one step further and is designed to alert ML users that conditional calibration with respect to uncertainty (consistency), as commonly tested by reliability diagrams 28,30,31,38,[41][42][43] , does not guarantee individual calibration.In order to approach individual calibration, it is necessary to ensure conditional calibration with respect to input features (adaptivity).As reliability diagrams are not designed to deal with adaptivity, a more convenient statistical framework dealing homogeneously with consistency and adaptivity is proposed.The corresponding workflow is illustrated in Fig. 1.
Note that for the sake of brevity, I present here only methods for variance-based UQ metrics, but the approach can be directly transposed to interval-based metrics 44 .Also, this  study does not offer advice nor recipes on how to achieve good conditional calibration.

B. Structure of the article
The next section (Sect.II) introduces the notations and theoretical elements.The main notations and acronyms are summarized in Table I.The validation methods are presented in Sect.III and applied to a computational chemistry oriented example.The main conclusions are presented in Sect.IV.

II. VALIDATION OF VARIANCE-BASED UQ METRICS
The validation of variance-based UQ metrics requires at minimum a set of predicted values , and reference data to compare with R = {R i } M i=1 (with their uncertainties u R = {u R i } M i=1 , when relevant).From these, one estimates prediction errors E = R − V and prediction uncertainties u E = (u 2 R + u 2 V ) 1/2 .These data enable to test average calibration (Sect.II A) and consistency as conditional calibration with respect to uncertainty (Sect.II B).An additional set of input features or adequate proxies i=1 is required for a full validation setup including conditional calibration with respect to inputs or adaptivity (Sect.II B).

A. Average calibration
The validation of prediction uncertainty u E can be based on the requirement that it correctly quantifies the dispersion of prediction errors E. 21,45 Following the metrological definition of uncertainty, this is valid for so-called adequate models, i.e. models with negligible systematic or model errors.For models with non-negligible inadequacy levels, as can be expected, for instance, from ML methods below the interpolation threshold 46 , prediction uncertainty is often designed to cover also for model errors. 47In such cases, validation should account for both bias and dispersion components of the errors.I will not discuss here the problems of reporting such uncertainties for actionable predictions or risk assessment, but will mostly focus on a self-consistent setup for the validation of UQ calibration where one seeks a reliable estimation of the amplitude of errors.
In such conditions, the basis for validation is to require that the mean squared error (MSE) is close to the mean variance over the validation dataset 47 However, this formula ignores the essential one-to-one pairing of errors and uncertainties, and a more stringent approach is based on z-scores (Z = E/u E ), using the condition which is related to the Birge ratio 48 for the validation of the residuals of least-squares fit 21 . Remarks.
• If E and u E are obtained as the means and standard deviations of small ensembles of predictions (e.g. with less than 30 elements) these formulas have to be adapted, and hypotheses need then to be made on the error distributions for these small ensembles 21 .For a normal generative distribution of errors, the distribution of the mean of n values (ensemble size) is a Student's-t distribution with ν = n − 1 degrees of freedom, and one should have 21 • Unbiasedness is not an essential part of calibration, but it is a highly desirable property for predictions and z-scores and will be systematically considered as a test of prediction quality, i.e.
The satisfaction of Eq.

B. Individual, conditional and local calibration
The best calibration one could ideally achieve is individual calibration, a condition where one is confident that uncertainty is correctly calibrated for any individual prediction.The formulation of individual calibration for probabilistic forecasters by Chung et al. 39 , led them to formalize it as conditional calibration in input features space.In practice (i.e. for finite size datasets), individual calibration has been shown to be unreachable 40 , and an alternative is to consider a discretized form as local or group calibration 49 .This is reflected in the practical estimation of conditional statistics by data binning or grouping 39 .In a similar spirit, conditional coverage with respect to input features was proposed by Vovk 32 to assess the adaptivity 33 of conformal predictors.
For variance-based UQ metrics, Levi et al. 30,42 proposed an approach based on conditional calibration in uncertainty space, namely which is the basis of the popular reliability diagrams 28 or RMSE vs RMV plots, also called calibration diagrams 41 , error-based calibration plots 38 , RvE plots 50 , or RMSE vs. RMV curves 43 .
Levi et al. claim that, assuming that each uncertainty value occurs only once in the dataset, their method "captures the desired meaning of calibration, i.e., for each individual example, one can correctly predict the expected mistake".In practice, the unicity assumption faces two major difficulties: (1) some datasets are stratified, with several occurrences of the same uncertainty value 51 , and (2) the practical implementation of conditional calibration requires to group data to estimate the mean squared error (MSE), breaking the one-to-one correspondence between the tested uncertainties and errors, as mentioned above for average calibration.
In consequence, conditional calibration based on Eq. 5 is not sufficient to validate calibration at the individual level.To go further, one should consider other conditioning variables besides u E , notably input features or variables of interest for the end-user of a ML model, as proposed for probabilistic forecasters 39 and conformal predictors 32,33 .
Building on the the works of Levi et al. 42 , Pernot 21 and Angelopoulos et al. 33 about conditional calibration, I propose here to distinguish two calibration targets (besides average calibration), namely consistency as the conditional calibration with respect to prediction uncertainty, and adaptivity as the conditional calibration with respect to input features: • Consistency is a special case of conditional calibration, in the sense that it involves only E and u E .Using the z-scores statistics introduced for average calibration, one can define consistency by the following equation Consistency is related to the metrological consistency of measurements 52 .
• Adaptivity is also conveniently formulated with z-scores as where X is the ensemble of values accessible to X. Adaptivity involves more information than consistency (X, E, and u E ).
Unless there is a monotonous transformation between u E and X, consistency and adaptivity are distinct calibration targets, and a good consistency does not augur of a good adaptivity and vice-versa, so that both should be assessed.Note that tightness, as introduced earlier by Pernot 21 , covers both consistency and adaptivity.
Average calibration is a necessary condition to reach consistency or adaptivity.In fact, consistency/adaptivity expressed as conditional calibration should imply average calibration, but the splitting of the data into subsets makes that the power of individual consis-tency/adaptivity tests is smaller than for the full validation set.It is therefore better to test average calibration separately, notably for small validation datasets.
Most methods used to this day for the validation of variance-based UQ metrics in chemical/materials sciences ML studies involve only E and u E (reliability diagrams, calibration curves, confidence curves...) 38 .Adaptivity can thus be considered as a blind spot in UQ validation, despite its necessity to achieve reliable UQ at the molecule-specific level advocated by Reiher 36 .

III. VALIDATION METHODS
This section presents z-scores-based methods to assess and validate consistency and adaptivity.An alternative formulation, based on relative calibration errors, is also proposed in Sect.III B 4.

A. Homoscedasticity plots of z-scores
A simple way to estimate consistency is to plot the z-scores Z as a function of u E 21 .The dispersion of Z should be homogeneous along u E (homoscedasticity) and, ideally, symmetric around Z = 0 (unbiasedness).In areas where the z-scores are biased, if any, one should observe a larger dispersion.This might not be easy to appreciate visually, and running statistics can be superimposed to the data cloud such as the mean, to be compared with Z = 0, and mean squares to be compared with the Z = 1 line.
This plot is easily extended to any variable X other than u E and can be directly applied to the visual appreciation of adaptivity.Note that in the present context, the Z vs u E plot is preferable to the E vs u E plot used in other studies 13,21,25 , as it offers a consistent representation for both consistency and adaptivity estimation.
For cases where consistency/adaptivity cannot be frankly rejected on the basis of the shape or scale of this data cloud, it is necessary to perform more quantitative tests as presented below.One should not conclude on good consistency/adaptivity based solely on this kind of plot.
⊲ Example, continued.The molecular mass (X 1 ; in Dalton (Da)) and fraction of heteroatoms (X 2 ; unitless) are generated from the molecular formulas of the QM9 dataset, and used as proxies for input features.They are practically uncorrelated between themselves and weakly  correlated with |E| and u E (Table II).The dataset can thus be tested for consistency and adaptivity.
The homoscedasticity of z-scores for the QM9 dataset is estimated against u E , X 1 and X 2 (Fig. 2).One sees in Fig. 2(a) that the data points are fairly symmetrically dispersed (mean; red line) and that the running mean squares (orange line) follows rather closely the Z = 1 line, up to uE ≃ 0.02 eV, after which it lies at higher values.However, this concerns a small population (980 points) and the problem could be due to the data sparsity in this uncertainty range.
The "Z vs X 1 " plot in Fig. 2(b) enables to check if calibration is homogeneous in molecular mass space.The running mean does not deviate notably from 0 (except around X 1 ≃ 100 Da, with a correlated increase in < Z 2 >).The shape of the running mean squares line, erring towards small < Z 2 > values, indicates that uncertainties are probably overestimated for masses smaller than the main mass cluster (around 125-130 Da) evolving to a slight underestimation above this peak.This trend hints at a lack of adaptivity.A similar plot is shown for X 2 , the fraction of heteroatoms [Fig.2(c)] where the running mean presents a weak but systematic trend from positive to negative values.Besides, the z-scores are under-dispersed for molecules with low heteroatoms fractions (below 0.1), after which the running mean squares line presents notable oscillations around the Z = 1 reference line and seem to stabilize above X 2 = 0.4, where the data are sparse.
In this dataset, stratification of the conditioning variables is notable.For instance, the set of uncertainties u E contains only 138 distinct numerical values, a fact which can be attributed to recalibration by a step-wise isotonic regression function.But stratification might also occur independently of any algorithm: X 1 contains 398 unique values, and X 2 is strongly stratified, with only 76 values.Stratification should be taken into account when binning these variables (see Sect.III B 3).
From these three plots, one gets the impression that calibration is rather good at the core of the dataset (where the density of data is highest), but more problematic in the margins.
A more quantitative analysis of these features is desirable, but one might already conclude that adaptivity is not reached.

B. Local calibration
Conditional calibration in uncertainty space as formulated in Eq. 5 is often tested in the literature by reliability diagrams based on groups defined as uncertainty bins 30 .This representation does not adapt conveniently to other grouping schemes.In contrast, the approach based on z-scores (Eqns.6, 7), besides its interest evoked for average calibration, offers a uniform treatment for all conditioning variables and is used preferentially in this study.For readers more familiar with the use of calibration errors, an alternative formulation based on Local Relative Calibration Error is proposed in Sect.III B 4.

Local Z-Mean and Z-Mean-Squares analysis
Testing for consistency is based on a binning of the data according to increasing uncertainties.A Local Z Variance (LZV) analysis was introduced by Pernot 45 as a method to test local calibration: for each bin, one estimates Var(Z) and compares it to 1.In the present framework, the LZV analysis is adapted to account for the possibility of accepting significant deviations of local < Z > values from 0, and one will be using the Local Z Mean Squares (LZMS) statistic, based on Eq. 6.A Local Z Mean statistic can also be used to check the local unbiasedness of z-scores.
Assessment of a LZMS analysis is based on two criteria, the deviation of the LZMS values from 1 and the homogeneity of their distribution along the conditioning variable.
The maximal admissible deviations depend on the bin size and error distribution.For instance, one should expect larger deviations from errors and uncertainties obtained as statistical summaries of small ensembles than from errors and uncertainties describing a normal distribution.Unless the error model is well known and controlled, which might not be the case for post-hoc calibration methods, it is impossible to define a threshold to LZMS values for validation purpose.It is therefore necessary to estimate confidence intervals (CIs) on the LZMS values to test their consistency with the target value.
For < Z >, the standard formula based on the quantiles of the Student's-t distribution provides intervals with satisfactory coverage, even for small samples and non-normal distributions with finite variance.The case of < Z 2 > is more difficult, as standard formulas fare poorly when one deviates from the standard normality of the z-scores.The same problem was observed for Var(Z), 45 and a convergence and power study concluded that the most reliable approach was to use bootstrapping 53,54 with samples of at least 100 points.In such conditions, the effective coverage of 95 % CIs reaches at least 90 % for Var(Z) and < Z 2 >.
To achieve a 95% coverage with a Student's distribution of z-scores, 1000 points per bin are required, which might limit the resolution of the local analysis.When in doubt, the LZMS analysis can be performed with several bin sizes to assess its reliability.When reliable CIs are obtained, the proportion of valid intervals, i.e. those covering the target value, can be used as a validation metric (Sect.III B 2).
The homogeneity of the distribution of the LZMS values along the conditioning variable can often be appreciated visually (any cluster showing systematic deviation from the target represents a local calibration problem).However, the graph might sometimes be crowded, and the auto-correlation function (ACF) of the LZMS statistics might help to detect the presence of unsuitable serial correlations.

Validation metrics
Calibration metrics 1,55 are widely used in the ML-UQ literature: for instance, metrics have been designed for calibration curves 8,29 , reliability diagrams 42 , and confidence curves 38 .
These metrics are generally used to compare and rank UQ methods, but they do not provide a validation setup accounting for the statistical fluctuations due to finite-sized datasets or bins numbers.It has been shown recently 56 , that the expected normalized calibration error (ENCE) 38,42,43 cannot be used directly as a validation metric.The calculation of reference values for those metrics is an option introduced recently 57,58 , but being based on a proba-bilistic model, it requires the choice of a probability distribution for the errors which might complicate the diagnostic.
Here, let us take advantage of the availability of confidence intervals on the local statistics of the LZM and LZMS methods as the basis for a validation metric.For a perfectly calibrated dataset, the fraction of binned statistics with a CI containing the target value should be close to the coverage probability of the CIs.Namely, about 95 % of the binned < Z 2 > values should have 95 % CIs containing the target value.Let us denote this fraction of validated intervals by f v,ZMS .In practice, one should not expect to recover exactly 95 %, and a CI for f v,ZMS has to be estimated from the binomial distribution to account for the limited number of bins 45 .

Binning/grouping strategies
A sensitive point for the LZMS analysis is the choice of a binning scheme.The bin size should be small enough to get as close as possible to individual calibration testing and to provide information on the localization of any miscalibrated area, but also large enough to ensure a reasonable power for statistical estimation and testing of binned statistics.Moreover, stratification of the dataset, if present, has also to be considered.
a. Effect of equal-size binning for stratified conditioning variables.The equal-size binning scheme is a standard approach implemented for instance in reliability diagrams 30 .One has to be aware that it does not account for the possible stratification of the conditioning variable.It was shown recently that bin-based statistics in such conditions are affected by the order of the data in the analyzed dataset. 51This effect is unavoidable and its impact on the statistics should be checked, for instance by repeated estimation of the binned statistics for randomly reordered datasets.
Note that getting a good estimate of f v,ZMS requires contradictory conditions, i.e. reliable confidence intervals for the binned statistics, therefore a number of points per bin as large as possible, but also a number of bins as large as possible.A good balance is obtained by choosing the number of bins as the square root of the dataset size (N = M 1/2 ).The use of this statistic should therefore preferably be reserved to large datasets, with more than 10 4 points.
b. Stratified binning.For notably stratified conditioning variables, a binning scheme preserving the strata might be more appropriate than equal-size binning, as it avoids the splitting of strata into arbitrary bins.However, many strata might have sizes too small to enable reliable statistics.Instead of rejecting these low-counts strata, one can merge them with their neighbors.I use here an iterative algorithm where any small stratum (typically less than 100 points) is merged with the smallest of its neighbors.The value and counts of the strata are updated according to the relative counts of the merged strata.This simple merging is iterated until no small stratum is left.The result is not affected by data ordering.
An inconvenience is that one does not have the control of the number of bins, which might get too low for a reliable estimation of the f v,ZMS validation statistic.
c. Choice of conditioning variables or groups for adaptivity.Although using one or several input features as conditioning variable is the most direct way to test adaptivity, it might not always be practical, for instance when input features are strings, graphs or images.In such cases, one might use dimension reduction algorithms such as t-SNE 25,59 or UMAP 60 in order to define relevant groups.One might also use proxy variables, latent variables, or even the predicted property value V. Using V answers to the question: are uncertainties reliable over the full range of predictions ?A problem with V is that is is potentially strongly correlated with E which might lead to spurious features in the LZMS analysis.For the complementarity of consistency and adaptivity tests, it is better to use X variables that are not strongly correlated to E and u E .If there is no sensible way to define a conditioning variable, one might consider adversarial group validation.
d. Adversarial groups.One can avoid to choose conditioning variables by designing random groups to be tested for calibration, in the spirit of adversarial group calibration (AGC).
In AGC, the largest calibration error is estimated over a set of random samples of a given size, for sizes varying from a small fraction of the dataset to the full dataset. 39,40,61This approach is mostly used to compare datasets, but does not provide a validation setup.As exposed above, even for fully calibrated datasets, the amplitude of calibration errors depends also on the group size and error distribution, which makes comparisons difficult for datasets with unknown or different distributions.
It is possible to design an AG validation method based of the f v,ZMS metric.Preliminary test of this approach revealed three main limitations that make it unpractical: (1) if a dataset presents localized calibration issues, there is a low probability to randomly sample groups revealing this problem, and one will get an overly optimistic diagnostic; (2) random groups are not interpretable, and there is therefore very few to learn about the origins of miscalibration; and (3) the computer time for the repeated estimation of converged bootstrap-based CIs might be prohibitive considering the small information return.
Another option for validation is to use a conventional AGC curve and build a probabilis-tic reference AGC curve, as suggested previously for confidence curves 57 .In contrast with the latter case, this probabilistic AGC reference curve is very sensitive to the choice of a probability distribution for the generated errors, which might lead to ambiguous diagnostics.
Because of these difficulties, AG validation has not been retained for the present study.
More generally, it has nevertheless to be kept as an option when the design of adequate conditioning variables is problematic.Further research to design a robust AGC reference curve is needed.
⊲ Example, continued.The unbiasedness and consistency of the QM9 validation set are tested by performing a LZM/LZMS analysis in u E space with 100 equi-sized bins.The corresponding f v validation statistics are presented in Fig. 4.
One sees for < Z > [Fig.a)].The fraction of valid intervals (in blue) is high ( f v,ZM = 0.97) and in statistical agreement with its target value of 0.95.For < Z 2 > [Fig.are the statistics for the binning scheme based on the preservation of strata, with binomial uncertainty.They summarize the LZM and LZMS analyses reported in Fig. 5.
For the fraction of heteroatoms [Fig.3(c,f,i)], the LZM analysis displays the same trend from positive to negative < Z > values as observed in Fig. 2(c), with a sub-optimal fraction of valid bins ( f v,ZM = 0.80).The LZMS analysis reveals clusters of deviant bins at several spots along the X 2 axis, which is reflected in a slowly decreasing ACF.The fraction of valid bins is small ( f v,ZMS = 0.62).Here again, despite the strong stratification of X 2 , the f v,ZMS statistic is not strongly affected by the reordering perturbation [Fig.4(c)].
The LZM/LZMS analysis based on stratified binning with a minimum of 100 points per bin is presented in Fig. 5, and the f v values are also reported in Fig. 4.This representation is less crowded than the equal-size binning and provides essentially the same conclusions.
One notes more severe values of the f v statistic for the adaptivity analysis, with larger error bars due to a smaller number of bins.
All diagnostics based on E and u E conclude therefore to a good calibration and an acceptable consistency.The main feature revealed by this analysis is the lack of adaptivity seen by the LZMS analysis for both molecular mass and fraction of heteroatoms.A major trend is a significant underestimation of the quality of predictions for the lighter molecules in the QM9 dataset (below 120 Da) and also for those with a small fraction of heteroatoms (below 0.1).

Alternative approach: the Local Relative Calibration Error
Deriving from the logic behind reliability diagrams, a popular measure to assess the error in calibration is the Expected Normalized Calibration Error (ENCE) 30 , which averages the absolute relative calibration errors over the bins where N is the number of bins, RCE i is the Relative Calibration Error in bin i RMSE i is the root mean squared error for bin i, and RMV i is the root mean variance (u 2 E ) in bin i.
In the usual applications of ENCE, the bins are based on uncertainty, so that the ENCE is a measure of consistency.However, Eq. 8 is valid for any binning scheme, and therefore the ENCE can also be used as an adaptivity measure.
In this context, the binned RCE offers an alternative to the z-scores formulation and can be used to establish conditional equations similar to Eqns.6-7 (RCE|u E = σ) ≃ 0, ∀σ > 0 (10)   and This defines the Local RCE (LRCE) analysis that can be implemented through data binning according to any conditioning variable, as for the LZMS analysis.
This formulation could be more appealing to users familiar with the ENCE, despite the underlying problem mentioned for average calibration that the RMSE and RMV values are insensitive to the pairing of errors and uncertainties.The ZMS approach is therefore more robust.However, the most important goal at the present stage of ML-UQ development is for practitioners to assess adaptivity, be it by LRCE or LZMS.Besides, a LRCE analysis could be more consistent with existing ENCE-based toolboxes, such as the Uncertainty Toolbox 61 .

IV. CONCLUSIONS
The concept of conditional calibration enables to define two aspects of local calibration: consistency, which assesses the reliability of UQ metrics across the range of uncertainty values, and adaptivity, which assesses the reliability of UQ metrics across the range of input features.As the validation of individual calibration is practically impossible, one has to rely on validation methods based on local or group calibration, making consistency and adaptivity complementary validation targets.
Consistency and adaptivity can be tested by binned statistics such as the mean of squared z-scores < Z 2 >, leading to the LZMS analysis.Bins with large deviations from the target value (typically 1) and groups of adjacent bins with similar deviations reveal local calibration errors.The LZMS analysis enables to test conditional calibration for any conditioning variable, giving access to both consistency and adaptivity validation.An alternative formulation based on local relative calibration errors (LRCE) could also be considered.A validation metric f v based on the proportion of bins with the confidence interval of a statistic containing its target value was proposed.The focus of this study is on variance-based UQ metrics, but this validation framework can easily be extended to interval-based UQ metrics 21,45 .
These methods were applied to a representative example issued from a recent study by Busk et al. 31 about atomization energies from the QM9 dataset, revealing a good average calibration, a slightly sub-optimal consistency, and a problematic adaptivity, either in the molecular mass space or the heteroatoms fraction space.This dataset presents several sources of stratification, and it was shown that the uncertainty due to the interplay of equal-size binning with data ordering expected for stratified conditioning variables is not dominant for the statistics considered here.An alternative strata-based binning LZMS approach led to similar diagnostics, with the inconvenience of a larger uncertainty on the validation statistics due to the smaller numbers of bins.
Up to now, ML-UQ validation studies in chemical and materials sciences are mainly focused on consistency.This covers somehow the concerns of ML-UQ designers who want the reliability of all uncertainties, either small or large.It was shown however that a positive consistency diagnostic does not augur of a positive adaptivity diagnostic, and therefore that a good consistency does not imply a good individual calibration.There is therefore a strong need that adaptivity be also systematically considered in ML-UQ studies, notably for final users, who expect the reliability of uncertainty for individual predictions, throughout the input features space.

Figure 1 .
Figure 1.Flowchart of the z-scores-based validation framework.

Figure 2 .
Figure 2. QM9 dataset: z-scores vs. uncertainty (a), molecular mass (b) and the fraction of heteroatoms (c).Running statistics (mean (< Z >) in red and mean squares (< Z 2 >) in orange) are estimated for a sliding window of size M/100.

Z 2 >
Fig.4(a)].To account for the finite number of bins, each value has also been perturbed by binomial noise [orange diamonds in Fig.4(a)].In the present case, the dispersion due to data reordering is small when compared to the binomial uncertainty for both statistics.The data ordering uncertainty is not sufficient to get an overlap of the confidence interval for f ν,ZMS with the target coverage (0.95), validating the conclusions of the nominal LZMS analysis.Adaptivity is tested using the same protocol.For the molecular mass [Fig.3(b,e,g)], the fraction of biased bins reaches 12 % ( f v,ZM = 0.88) and the fraction of deviant bins for < Z 2 > is about 40 % ( f v,ZMS = 0.6).There is a strong predominance of deviant bins for the masses below 120 Da, where the small values of the statistic point to overestimated uncertainties.In consequence, the ACF of the LZMS series presents a slow decay, to be compared with the one obtained for u E .If one accepts that there is no strong bias of Z in this area, values of < Z 2 > around 0.5 can be interpreted as an excess factor of 1/ √ 0.5 ≃ 1.4 for the uncertainties.Here again, despite the notable stratification of the molecular masses, the ordering of the data has not a strong impact on the f v statistics [Fig.4(b)].

Figure 3 .
Figure 3. QM9 validation dataset.Consistency and adaptivity validation plots based on 100 equalsize bins: LZM and LZMS analyses and ACF of LZMS vs. u E (a, d, g), molecular mass X 1 (b, e, h)and fraction of heteroatoms X 2 (c, f, i).For the LZM and LZMS analyses (a-f), the red symbols depict confidence intervals that do not contain the target statistic (0.0 for < Z >; 1.0 for < Z 2 >), and the mean statistic (for the whole dataset) is reported in the right margin, with the same color code as for the local statistics.The corresponding f v statistics are reported in Fig.4.

Figure 4 .
Figure 4. Fraction of validated bins for < Z > (left) and < Z 2 > (right) according to three conditioning variables (a-c).The error bars depict 95 % confidence intervals.The fractions should ideally be compatible with the 0.95 target (horizontal dashed line).The "Nominal" values (black circles) result from equal-sized binning with 100 bins of the dataset and the error bars are estimated from a binomial distribution.They summarize the LZM and LZMS analyses reported in Fig. 3.The "Random order" values (red squares) display the mean and 95 % confidence interval for a random ordering of the dataset (based on 1000 permutations).The "Random + Binomial" values (orange diamonds) combine the binomial uncertainty with the previous values.The "Stratified" values (green triangles)

Figure 5 .
Figure 5. QM9 validation dataset.LZM, LZMS analyses vs. u E (a, d), molecular mass (b, e) and fraction of heteroatoms (c, f).The data have been aggregated to get a minimum of 100 points per stratum.The red symbols depict confidence intervals that do not contain the target statistic (0.0 for < Z >; 1.0 for < Z 2 >).The mean statistic (over the whole dataset) is reported in the right margin, with the same color code as for the local statistics.The corresponding f v statistics are reported in Fig. 4.

Table I .
Main acronyms and notations used in this study.