stats¶
Statistical analysis and STRATOS-compliant metrics.
Overview¶
This module provides:
- Calibration metrics (slope, intercept, O:E ratio)
- Clinical utility (Net Benefit, DCA)
- Uncertainty quantification
Calibration Metrics¶
calibration_metrics
¶
Calibration metrics for classification models.
Provides ECE (Expected Calibration Error), Brier score, and calibration curves.
get_calibration_curve
¶
Compute calibration curve using sklearn.
| PARAMETER | DESCRIPTION |
|---|---|
model
|
Classifier model (unused, kept for API compatibility).
TYPE:
|
y_true
|
True binary labels.
TYPE:
|
preds
|
Dictionary with 'y_pred_proba' key.
TYPE:
|
n_bins
|
Number of bins for calibration curve.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary with 'prob_true' and 'prob_pred' arrays. |
Source code in src/stats/calibration_metrics.py
get_calibration_metrics
¶
TODO! Have some calibration metrics? ECE? (AU)RC? Brier Score? Alasalmi et al. 2020, Better Classifier Calibration for Small Datasets, https://doi.org/10.1145/3385656 https://scholar.google.co.uk/scholar?cites=12369575194770427495&as_sdt=2005&sciodt=0,5&hl=en
Nixon et al. 2019, Measuring Calibration in Deep Learning, https://openreview.net/forum?id=r1la7krKPS https://scholar.google.fi/scholar?cites=671990448700625194&as_sdt=2005&sciodt=0,5&hl=en
Are XGBoost probabilities well-calibrated? - "No, they are not well-calibrated." https://stats.stackexchange.com/a/617182/294507 "Do note that "badly" calibrated probabilities are not synonymous with a useless model but I would urge one doing an extra calibration step (i.e. Platt scaling, isotonic regression or beta calibration) if using the raw probabilities is of importance."
https://stats.stackexchange.com/a/619981/294507 "XGBoost is well calibrated providing you optimise for log_loss (as objective and in hyperparameter search)."
Source code in src/stats/calibration_metrics.py
Classifier Calibration¶
classifier_calibration
¶
Post-training classifier calibration methods.
Provides wrappers for sklearn calibration methods (isotonic, Platt scaling).
isotonic_calibration
¶
Apply isotonic calibration to a pre-fitted classifier.
| PARAMETER | DESCRIPTION |
|---|---|
i
|
Bootstrap iteration index (for logging on first iter).
TYPE:
|
model
|
Pre-fitted sklearn-compatible classifier.
TYPE:
|
cls_model_cfg
|
Classifier configuration.
TYPE:
|
dict_arrays_iter
|
Dictionary with 'x_val' and 'y_val' for calibration.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
object
|
Calibrated classifier wrapper. |
Source code in src/stats/classifier_calibration.py
bootstrap_calibrate_classifier
¶
bootstrap_calibrate_classifier(
i: int,
model,
cls_model_cfg: DictConfig,
dict_arrays_iter: dict,
weights_dict: dict,
)
https://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html https://scikit-learn.org/stable/auto_examples/calibration/plot_calibration.html#sphx-glr-auto-examples-calibration-plot-calibration-py https://www.kaggle.com/code/banddaniel/rain-pred-catboost-conformal-prediction-f1-0-84?scriptVersionId=147866075&cellId=33
Whole another thing is whether your submodels (bootstrap iterations) need to be calibrated, and whether that will lead to a good ensembled performance? Wu and Gales (2020): "Should Ensemble Members Be Calibrated?" https://openreview.net/forum?id=wTWLfuDkvKp https://scholar.google.co.uk/scholar?cites=4462772606110879200&as_sdt=2005&sciodt=0,5&hl=en
Source code in src/stats/classifier_calibration.py
Uncertainty Quantification¶
uncertainty_quantification
¶
Uncertainty quantification metrics for classification models.
Provides AURC (Area Under Risk-Coverage curve), entropy, mutual information, and related uncertainty estimation metrics for bootstrap-based classifiers.
Cross-references: - planning/statistics-implementation.md (Section 3.3) - appendix-literature-review/section-13-calibration.tex
References: - Ding et al. (2020) "Revisiting Uncertainty Estimation" - Nado et al. (2022) "Uncertainty Baselines" (arXiv:2106.04015)
sec_classification
¶
Compute the AURC. Args: y_true: true labels, vector of size n_test y_pred: predicted labels by the classifier, vector of size n_test conf: confidence associated to y_pred, vector of size n_test Returns: conf: confidence sorted (in decreasing order) risk_cov: risk vs coverage (increasing coverage from 0 to 1) aurc: AURC eaurc: Excess AURC
Source code in src/stats/uncertainty_quantification.py
risk_coverage
¶
args: p_mean: np.ndarray, shape (n_subjects,) p_std: np.ndarray, shape (n_subjects,) preds: np.ndarray, shape (n_subjects, n_iters in bootstrap) y_true: np.ndarray, shape (n_subjects,)
Source code in src/stats/uncertainty_quantification.py
get_sample_mean_and_std
¶
Compute mean and standard deviation across bootstrap iterations.
| PARAMETER | DESCRIPTION |
|---|---|
preds
|
Predictions array of shape (n_subjects, n_bootstrap_iters).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ndarray
|
Mean predictions per subject. |
ndarray
|
Standard deviation per subject. |
ndarray
|
Binary predictions based on 0.5 threshold. |
Source code in src/stats/uncertainty_quantification.py
risk_coverage_wrapper
¶
Compute risk-coverage metrics from raw bootstrap predictions.
| PARAMETER | DESCRIPTION |
|---|---|
preds
|
Bootstrap predictions, shape (n_subjects, n_bootstrap_iters).
TYPE:
|
y_true
|
True labels, shape (n_subjects,).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ndarray
|
Coverage levels (0 to 1). |
ndarray
|
Risk at each coverage level. |
float
|
AURC value. |
float
|
Excess AURC (vs optimal). |
Source code in src/stats/uncertainty_quantification.py
uncertainty_wrapper_from_subject_codes
¶
uncertainty_wrapper_from_subject_codes(
p_mean: ndarray,
p_std: ndarray,
y_true: ndarray,
split: str,
)
Compute uncertainty metrics from pre-computed mean and std.
Used when bootstrap arrays are not equal-sized (train/val splits).
| PARAMETER | DESCRIPTION |
|---|---|
p_mean
|
Mean predictions per subject.
TYPE:
|
p_std
|
Standard deviation per subject.
TYPE:
|
y_true
|
True labels.
TYPE:
|
split
|
Split name (for logging).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary with 'scalars' (AURC, AURC_E) and 'arrays' (coverage, risk). |
Source code in src/stats/uncertainty_quantification.py
get_uncertainties
¶
Epistemic uncertainty = Standard deviation of the Monte Carlo sample of estimated value Aleatoric uncertainty = Square root of the mean of the Monte Carlo sample of variance estimates https://shrmtmt.medium.com/beyond-average-predictions-embracing-variability-with-heteroscedastic-loss-in-deep-learning-f098244cad6f https://stackoverflow.com/a/63397197/6412152
Note! that there are multiple ways to estimate epistemic and aleatoric uncertainty And whole another thing if the bootstrap output qualifies as a source of these
Source code in src/stats/uncertainty_quantification.py
uncertainty_metrics
¶
Compute all uncertainty metrics for bootstrap predictions.
| PARAMETER | DESCRIPTION |
|---|---|
preds
|
Bootstrap predictions, shape (n_subjects, n_bootstrap_iters).
TYPE:
|
y_true
|
True labels, shape (n_subjects,).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary with 'scalars' (AURC, MI, entropy) and 'arrays'. |
Source code in src/stats/uncertainty_quantification.py
uncertainty_wrapper
¶
uncertainty_wrapper(
preds: ndarray,
y_true: ndarray,
key: str,
split: str,
return_placeholder: bool = False,
)
See predict_and_decompose_uncertainty_tf() in uncertainty_baselines This implementation copied straight from there (Nado et al. 2022, https://arxiv.org/abs/2106.04015) epistemic uncertainty (MI), and aleatoric uncertainty (expected entropy) https://github.com/google/uncertainty-baselines
See also: https://torch-uncertainty.github.io/api.html#diversity https://github.com/kyle-dorman/bayesian-neural-network-blogpost https://github.com/yizhanyang/Uncertainty-Estimation-BNN/blob/master/main.py https://github.com/yaringal/ConcreteDropout/blob/master/concrete-dropout-pytorch.ipynb https://github.com/rutgervandeleur/uncertainty/tree/master https://github.com/Kyushik/Predictive-Uncertainty-Estimation-using-Deep-Ensemble/blob/master/Ensemble_Regression_ToyData_Torch.ipynb means = torch.stack([tup[0] for tup in MC_samples]).view(K_test, X_val.shape[0]).cpu().data.numpy() logvar = torch.stack([tup[1] for tup in MC_samples]).view(K_test, X_val.shape[0]).cpu().data.numpy() epistemic_uncertainty = np.var(means, 0).mean(0) logvar = np.mean(logvar, 0) aleatoric_uncertainty = np.exp(logvar).mean(0)
See also "Uncertainty in Gradient Boosting via Ensembles", https://arxiv.org/abs/2006.10562 https://github.com/yandex-research/GBDT-uncertainty https://github.com/yandex-research/GBDT-uncertainty/blob/main/aggregate_results_classification.py - See also our tutorials on uncertainty estimation with CatBoost: https://towardsdatascience.com/tutorial-uncertainty-estimation-with-catboost-255805ff217e https://github.com/catboost/catboost/blob/master/catboost/tutorials/uncertainty/uncertainty_regression.ipynb
Maybe a conformal prediction with the classifier could be nice? https://github.com/PacktPublishing/Practical-Guide-to-Applied-Conformal-Prediction/blob/main/Chapter_05_TCP.ipynb
Args: preds: np.ndarray, shape (n_subjects, n_iters in bootstrap), class 1 probability (e.g. glaucoma probability) key: str, key for the uncertainty return_placeholder: bool, return placeholder if True