stats¶

Statistical analysis and STRATOS-compliant metrics.

Overview¶

This module provides:

Calibration metrics (slope, intercept, O:E ratio)
Clinical utility (Net Benefit, DCA)
Uncertainty quantification

Calibration Metrics¶

calibration_metrics ¶

Calibration metrics for classification models.

Provides ECE (Expected Calibration Error), Brier score, and calibration curves.

get_calibration_curve ¶

get_calibration_curve(
    model, y_true, preds: dict, n_bins: int
)

Compute calibration curve using sklearn.

PARAMETER	DESCRIPTION
`model`	Classifier model (unused, kept for API compatibility). TYPE: `object`
`y_true`	True binary labels. TYPE: `ndarray`
`preds`	Dictionary with 'y_pred_proba' key. TYPE: `dict`
`n_bins`	Number of bins for calibration curve. TYPE: `int`

RETURNS	DESCRIPTION
`dict`	Dictionary with 'prob_true' and 'prob_pred' arrays.

Source code in src/stats/calibration_metrics.py

def get_calibration_curve(model, y_true, preds: dict, n_bins: int):
    """
    Compute calibration curve using sklearn.

    Parameters
    ----------
    model : object
        Classifier model (unused, kept for API compatibility).
    y_true : np.ndarray
        True binary labels.
    preds : dict
        Dictionary with 'y_pred_proba' key.
    n_bins : int
        Number of bins for calibration curve.

    Returns
    -------
    dict
        Dictionary with 'prob_true' and 'prob_pred' arrays.
    """
    prob_true, prob_pred = calibration_curve(
        y_true, preds["y_pred_proba"], n_bins=n_bins
    )
    # disp = CalibrationDisplay.from_predictions(y_true, preds["y_pred_proba"])
    return {"prob_true": prob_true, "prob_pred": prob_pred}

get_calibration_metrics ¶

get_calibration_metrics(
    model,
    metrics,
    y_true: ndarray,
    preds: dict,
    n_bins: int = 3,
)

TODO! Have some calibration metrics? ECE? (AU)RC? Brier Score? Alasalmi et al. 2020, Better Classifier Calibration for Small Datasets, https://doi.org/10.1145/3385656 https://scholar.google.co.uk/scholar?cites=12369575194770427495&as_sdt=2005&sciodt=0,5&hl=en

Nixon et al. 2019, Measuring Calibration in Deep Learning, https://openreview.net/forum?id=r1la7krKPS https://scholar.google.fi/scholar?cites=671990448700625194&as_sdt=2005&sciodt=0,5&hl=en

Are XGBoost probabilities well-calibrated? - "No, they are not well-calibrated." https://stats.stackexchange.com/a/617182/294507 "Do note that "badly" calibrated probabilities are not synonymous with a useless model but I would urge one doing an extra calibration step (i.e. Platt scaling, isotonic regression or beta calibration) if using the raw probabilities is of importance."

https://stats.stackexchange.com/a/619981/294507 "XGBoost is well calibrated providing you optimise for log_loss (as objective and in hyperparameter search)."

Source code in src/stats/calibration_metrics.py

def get_calibration_metrics(
    model,
    metrics,
    y_true: np.ndarray,
    preds: dict,
    n_bins: int = 3,
):
    """
    TODO! Have some calibration metrics? ECE? (AU)RC? Brier Score?
    Alasalmi et al. 2020, Better Classifier Calibration for Small Datasets, https://doi.org/10.1145/3385656
    https://scholar.google.co.uk/scholar?cites=12369575194770427495&as_sdt=2005&sciodt=0,5&hl=en

    Nixon et al. 2019, Measuring Calibration in Deep Learning, https://openreview.net/forum?id=r1la7krKPS
    https://scholar.google.fi/scholar?cites=671990448700625194&as_sdt=2005&sciodt=0,5&hl=en

    Are XGBoost probabilities well-calibrated? - "No, they are not well-calibrated."
    https://stats.stackexchange.com/a/617182/294507
      "Do note that "badly" calibrated probabilities are not synonymous with a useless model but I would urge
      one doing an extra calibration step (i.e. Platt scaling, isotonic regression or beta calibration)
      if using the raw probabilities is of importance."

    https://stats.stackexchange.com/a/619981/294507
      "XGBoost is well calibrated providing you optimise for log_loss (as objective and in hyperparameter search)."


    """
    # Sorta hybrid, both classifier and calibration
    metrics["metrics"]["scalars"]["Brier"] = brier_score_loss(
        y_true, preds["y_pred_proba"]
    )

    # ECE probably enough from the "basics"? even with its flaws
    ece = BinaryCalibrationError(n_bins=n_bins, norm="max")
    preds_torch = torch.tensor(preds["y_pred_proba"])
    target = torch.tensor(y_true.astype(float))
    metrics["metrics"]["scalars"]["ECE"] = ece(preds_torch, target).item()

    # Get calibration curve, not a lot of points with our tiny data, but they get averaged
    # a bit then over iters?
    # See e.g. https://github.com/huyng/incertae/blob/master/ensemble_classification.ipynb
    metrics["metrics"]["arrays"]["calibration_curve"] = get_calibration_curve(
        model, y_true, preds, n_bins=n_bins
    )

    return metrics

Classifier Calibration¶

classifier_calibration ¶

Post-training classifier calibration methods.

Provides wrappers for sklearn calibration methods (isotonic, Platt scaling).

isotonic_calibration ¶

isotonic_calibration(
    i,
    model,
    cls_model_cfg: DictConfig,
    dict_arrays_iter: dict,
)

Apply isotonic calibration to a pre-fitted classifier.

PARAMETER	DESCRIPTION
`i`	Bootstrap iteration index (for logging on first iter). TYPE: `int`
`model`	Pre-fitted sklearn-compatible classifier. TYPE: `object`
`cls_model_cfg`	Classifier configuration. TYPE: `DictConfig`
`dict_arrays_iter`	Dictionary with 'x_val' and 'y_val' for calibration. TYPE: `dict`

RETURNS	DESCRIPTION
`object`	Calibrated classifier wrapper.

Source code in src/stats/classifier_calibration.py

def isotonic_calibration(i, model, cls_model_cfg: DictConfig, dict_arrays_iter: dict):
    """
    Apply isotonic calibration to a pre-fitted classifier.

    Parameters
    ----------
    i : int
        Bootstrap iteration index (for logging on first iter).
    model : object
        Pre-fitted sklearn-compatible classifier.
    cls_model_cfg : DictConfig
        Classifier configuration.
    dict_arrays_iter : dict
        Dictionary with 'x_val' and 'y_val' for calibration.

    Returns
    -------
    object
        Calibrated classifier wrapper.
    """
    if i == 0:
        logger.info("Calibrating classifier with isotonic calibration")
    model = CalibratedClassifierCV(model, method="isotonic", cv="prefit")
    model.fit(dict_arrays_iter["x_val"], dict_arrays_iter["y_val"])
    return model

bootstrap_calibrate_classifier ¶

bootstrap_calibrate_classifier(
    i: int,
    model,
    cls_model_cfg: DictConfig,
    dict_arrays_iter: dict,
    weights_dict: dict,
)

https://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html https://scikit-learn.org/stable/auto_examples/calibration/plot_calibration.html#sphx-glr-auto-examples-calibration-plot-calibration-py https://www.kaggle.com/code/banddaniel/rain-pred-catboost-conformal-prediction-f1-0-84?scriptVersionId=147866075&cellId=33

Whole another thing is whether your submodels (bootstrap iterations) need to be calibrated, and whether that will lead to a good ensembled performance? Wu and Gales (2020): "Should Ensemble Members Be Calibrated?" https://openreview.net/forum?id=wTWLfuDkvKp https://scholar.google.co.uk/scholar?cites=4462772606110879200&as_sdt=2005&sciodt=0,5&hl=en

Source code in src/stats/classifier_calibration.py

def bootstrap_calibrate_classifier(
    i: int, model, cls_model_cfg: DictConfig, dict_arrays_iter: dict, weights_dict: dict
):
    """
    https://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html
    https://scikit-learn.org/stable/auto_examples/calibration/plot_calibration.html#sphx-glr-auto-examples-calibration-plot-calibration-py
    https://www.kaggle.com/code/banddaniel/rain-pred-catboost-conformal-prediction-f1-0-84?scriptVersionId=147866075&cellId=33

    Whole another thing is whether your submodels (bootstrap iterations) need to be calibrated,
    and whether that will lead to a good ensembled performance?
    Wu and Gales (2020): "Should Ensemble Members Be Calibrated?"
    https://openreview.net/forum?id=wTWLfuDkvKp
    https://scholar.google.co.uk/scholar?cites=4462772606110879200&as_sdt=2005&sciodt=0,5&hl=en
    """

    if "CALIBRATION" in cls_model_cfg:
        if cls_model_cfg["CALIBRATION"]["method"] is None:
            if i == 0:
                logger.info("Skipping post-training calibration")
        elif cls_model_cfg["CALIBRATION"]["method"] == "isotonic":
            model = isotonic_calibration(i, model, cls_model_cfg, dict_arrays_iter)
        elif cls_model_cfg["CALIBRATION"]["method"] == "platt":
            raise NotImplementedError
            # i.e. "sigmoid" calibration
            # https://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html
            # https://ethen8181.github.io/machine-learning/model_selection/prob_calibration/prob_calibration.html
        elif cls_model_cfg["CALIBRATION"]["method"] == "beta":
            raise NotImplementedError
            # https://pypi.org/project/betacal/
            # https://stats.stackexchange.com/a/619981/294507
            # https://github.com/REFRAME/betacal/blob/master/python/tutorial/Python%20tutorial.ipynb
        elif cls_model_cfg["CALIBRATION"]["method"] == "conformal_platt":
            raise NotImplementedError
            # - https://github.com/aangelopoulos/conformal_classification
            # - https://github.com/aangelopoulos/conformal_classification/blob/master/example.ipynb
        else:
            logger.error(
                f"Unknown calibration method: {cls_model_cfg['CALIBRATION']['method']}"
            )
            raise ValueError(
                f"Unknown calibration method: {cls_model_cfg['CALIBRATION']['method']}"
            )
    else:
        if i == 0:
            logger.info(
                "No calibration speficied in your config, skipping post-training calibration"
            )

    return model

Uncertainty Quantification¶

uncertainty_quantification ¶

Uncertainty quantification metrics for classification models.

Provides AURC (Area Under Risk-Coverage curve), entropy, mutual information, and related uncertainty estimation metrics for bootstrap-based classifiers.

Cross-references: - planning/statistics-implementation.md (Section 3.3) - appendix-literature-review/section-13-calibration.tex

References: - Ding et al. (2020) "Revisiting Uncertainty Estimation" - Nado et al. (2022) "Uncertainty Baselines" (arXiv:2106.04015)

sec_classification ¶

sec_classification(y_true, y_pred, conf)

Compute the AURC. Args: y_true: true labels, vector of size n_test y_pred: predicted labels by the classifier, vector of size n_test conf: confidence associated to y_pred, vector of size n_test Returns: conf: confidence sorted (in decreasing order) risk_cov: risk vs coverage (increasing coverage from 0 to 1) aurc: AURC eaurc: Excess AURC

Source code in src/stats/uncertainty_quantification.py

def sec_classification(y_true, y_pred, conf):
    """Compute the AURC.
    Args:
    y_true: true labels, vector of size n_test
    y_pred: predicted labels by the classifier, vector of size n_test
    conf: confidence associated to y_pred, vector of size n_test
    Returns:
    conf: confidence sorted (in decreasing order)
    risk_cov: risk vs coverage (increasing coverage from 0 to 1)
    aurc: AURC
    eaurc: Excess AURC
    """
    assert len(y_true) == len(y_pred), "pred and label lengths do not match"
    n = len(y_true)
    ind = np.argsort(conf)
    y_true, y_pred, conf = y_true[ind][::-1], y_pred[ind][::-1], conf[ind][::-1]
    risk_cov = np.divide(
        np.cumsum(y_true != y_pred).astype("float"), np.arange(1, n + 1)
    )
    nrisk = np.sum(y_true != y_pred)
    aurc = np.mean(risk_cov)
    opt_aurc = (1.0 / n) * np.sum(
        np.divide(
            np.arange(1, nrisk + 1).astype("float"), n - nrisk + np.arange(1, nrisk + 1)
        )
    )
    eaurc = aurc - opt_aurc
    coverage = np.linspace(0, 1, num=len(risk_cov))

    if aurc == 0:
        logger.debug("AURC is 0, as in no risk in the model at any coverage")
        logger.debug("Is this a bug, or you got this with debug or something?")

    return coverage, risk_cov, aurc, eaurc

risk_coverage ¶

risk_coverage(p_mean, p_std, y_pred, y_true)

args: p_mean: np.ndarray, shape (n_subjects,) p_std: np.ndarray, shape (n_subjects,) preds: np.ndarray, shape (n_subjects, n_iters in bootstrap) y_true: np.ndarray, shape (n_subjects,)

Source code in src/stats/uncertainty_quantification.py

def risk_coverage(p_mean, p_std, y_pred, y_true):
    """
    args:
        p_mean: np.ndarray, shape (n_subjects,)
        p_std: np.ndarray, shape (n_subjects,)
        preds: np.ndarray, shape (n_subjects, n_iters in bootstrap)
        y_true: np.ndarray, shape (n_subjects,)
    """
    # AURC (Risk-Coverage curve)
    # https://github.com/IdoGalil/benchmarking-uncertainty-estimation-performance
    # Ding et al. (2020): "Revisiting the Evaluation of Uncertainty Estimation and Its Application to
    # Explore Model Complexity-Uncertainty Trade-Off"

    # the "stdev" variant of AURC
    assert len(p_mean) == len(y_pred), "pred and probs lengths do not match"
    assert len(y_true) == len(y_pred), "pred and label lengths do not match"
    conf = -p_std[np.arange(p_std.shape[0])]
    coverage, risk_cov, aurc, eaurc = sec_classification(y_true, y_pred, conf)

    return coverage, risk_cov, aurc, eaurc

get_sample_mean_and_std ¶

get_sample_mean_and_std(preds: ndarray)

Compute mean and standard deviation across bootstrap iterations.

PARAMETER	DESCRIPTION
`preds`	Predictions array of shape (n_subjects, n_bootstrap_iters). TYPE: `ndarray`

RETURNS	DESCRIPTION
`ndarray`	Mean predictions per subject.
`ndarray`	Standard deviation per subject.
`ndarray`	Binary predictions based on 0.5 threshold.

Source code in src/stats/uncertainty_quantification.py

def get_sample_mean_and_std(preds: np.ndarray):
    """
    Compute mean and standard deviation across bootstrap iterations.

    Parameters
    ----------
    preds : np.ndarray
        Predictions array of shape (n_subjects, n_bootstrap_iters).

    Returns
    -------
    np.ndarray
        Mean predictions per subject.
    np.ndarray
        Standard deviation per subject.
    np.ndarray
        Binary predictions based on 0.5 threshold.
    """
    p_mean = preds.mean(axis=1)  # shape (n_subjects,)
    p_std = preds.std(axis=1)  # shape (n_subjects,)
    y_pred = (p_mean > 0.5).astype(int)  # shape (n_subjects,)

    return p_mean, p_std, y_pred

risk_coverage_wrapper ¶

risk_coverage_wrapper(preds: ndarray, y_true: ndarray)

Compute risk-coverage metrics from raw bootstrap predictions.

PARAMETER	DESCRIPTION
`preds`	Bootstrap predictions, shape (n_subjects, n_bootstrap_iters). TYPE: `ndarray`
`y_true`	True labels, shape (n_subjects,). TYPE: `ndarray`

RETURNS	DESCRIPTION
`ndarray`	Coverage levels (0 to 1).
`ndarray`	Risk at each coverage level.
`float`	AURC value.
`float`	Excess AURC (vs optimal).

Source code in src/stats/uncertainty_quantification.py

def risk_coverage_wrapper(preds: np.ndarray, y_true: np.ndarray):
    """
    Compute risk-coverage metrics from raw bootstrap predictions.

    Parameters
    ----------
    preds : np.ndarray
        Bootstrap predictions, shape (n_subjects, n_bootstrap_iters).
    y_true : np.ndarray
        True labels, shape (n_subjects,).

    Returns
    -------
    np.ndarray
        Coverage levels (0 to 1).
    np.ndarray
        Risk at each coverage level.
    float
        AURC value.
    float
        Excess AURC (vs optimal).
    """
    p_mean, p_std, y_pred = get_sample_mean_and_std(preds)
    coverage, risk_cov, aurc, eaurc = risk_coverage(p_mean, p_std, y_pred, y_true)

    return coverage, risk_cov, aurc, eaurc

uncertainty_wrapper_from_subject_codes ¶

uncertainty_wrapper_from_subject_codes(
    p_mean: ndarray,
    p_std: ndarray,
    y_true: ndarray,
    split: str,
)

Compute uncertainty metrics from pre-computed mean and std.

Used when bootstrap arrays are not equal-sized (train/val splits).

PARAMETER	DESCRIPTION
`p_mean`	Mean predictions per subject. TYPE: `ndarray`
`p_std`	Standard deviation per subject. TYPE: `ndarray`
`y_true`	True labels. TYPE: `ndarray`
`split`	Split name (for logging). TYPE: `str`

RETURNS	DESCRIPTION
`dict`	Dictionary with 'scalars' (AURC, AURC_E) and 'arrays' (coverage, risk).

Source code in src/stats/uncertainty_quantification.py

def uncertainty_wrapper_from_subject_codes(
    p_mean: np.ndarray, p_std: np.ndarray, y_true: np.ndarray, split: str
):
    """
    Compute uncertainty metrics from pre-computed mean and std.

    Used when bootstrap arrays are not equal-sized (train/val splits).

    Parameters
    ----------
    p_mean : np.ndarray
        Mean predictions per subject.
    p_std : np.ndarray
        Standard deviation per subject.
    y_true : np.ndarray
        True labels.
    split : str
        Split name (for logging).

    Returns
    -------
    dict
        Dictionary with 'scalars' (AURC, AURC_E) and 'arrays' (coverage, risk).
    """
    # we don't have nice equal-sized np.ndarrays from train/val bootstrap so we compute these from the mean/std
    # we could also compute these from the subject codes, but we don't have them in the current setup
    metrics = {"scalars": {}, "arrays": {}}
    assert len(y_true) == len(p_mean), "pred and label lengths do not match"
    y_pred = (p_mean > 0.5).astype(int)  # shape (n_subjects,)
    (
        metrics["arrays"]["coverage"],
        metrics["arrays"]["risk"],
        metrics["scalars"]["AURC"],
        metrics["scalars"]["AURC_E"],
    ) = risk_coverage(p_mean, p_std, y_pred, y_true)

    return metrics

get_uncertainties ¶

get_uncertainties(preds)

Epistemic uncertainty = Standard deviation of the Monte Carlo sample of estimated value Aleatoric uncertainty = Square root of the mean of the Monte Carlo sample of variance estimates https://shrmtmt.medium.com/beyond-average-predictions-embracing-variability-with-heteroscedastic-loss-in-deep-learning-f098244cad6f https://stackoverflow.com/a/63397197/6412152

Note! that there are multiple ways to estimate epistemic and aleatoric uncertainty And whole another thing if the bootstrap output qualifies as a source of these

Source code in src/stats/uncertainty_quantification.py

def get_uncertainties(preds):
    """
    Epistemic uncertainty = Standard deviation of the Monte Carlo sample of estimated value
    Aleatoric uncertainty = Square root of the mean of the Monte Carlo sample of variance estimates
    https://shrmtmt.medium.com/beyond-average-predictions-embracing-variability-with-heteroscedastic-loss-in-deep-learning-f098244cad6f
    https://stackoverflow.com/a/63397197/6412152

    Note! that there are multiple ways to estimate epistemic and aleatoric uncertainty
    And whole another thing if the bootstrap output qualifies as a source of these
    """

    metrics = {"scalars": {}, "arrays": {}}

    # Calculating mean across multiple bootstrap iters
    mean = np.mean(preds, axis=-1)[:, np.newaxis]  # shape (n_samples, n_classes)

    # Calculating variance across multiple bootstrap iters
    # variance = np.var(preds, axis=-1)[:, np.newaxis]  # shape (n_samples, n_classes)

    # epistemic (check these)
    # metrics["scalars"]['UQ_epistemic'] = float(np.std(mean, axis=0))

    # aleatoric
    # preds_var_mean = np.mean(variance, axis=0)
    # metrics["scalars"]['UQ_aleatoric'] = float(np.sqrt(preds_var_mean))

    epsilon = sys.float_info.min
    # Calculating entropy across multiple bootstrap iters (predictive_entropy)
    # https://github.com/kyle-dorman/bayesian-neural-network-blogpost
    metrics["arrays"]["entropy"] = -np.sum(
        mean * np.log(mean + epsilon), axis=-1
    )  # shape (n_samples,)
    metrics["scalars"]["entropy"] = np.mean(metrics["arrays"]["entropy"])

    # Calculating mutual information across multiple bootstrap iters
    # The Mutual Information Metric to estimate the epistemic uncertainty of an ensemble of estimators.
    # A higher mutual information can be interpreted as a higher epistemic uncertainty.
    # https://torch-uncertainty.github.io/api.html#diversity
    metrics["arrays"]["mutual_information"] = metrics["arrays"]["entropy"] - np.mean(
        np.sum(-preds * np.log(preds + epsilon), axis=-1),
        axis=0,
    )
    metrics["scalars"]["MI"] = np.mean(metrics["arrays"]["mutual_information"])

    return metrics

uncertainty_metrics ¶

uncertainty_metrics(preds, y_true)

Compute all uncertainty metrics for bootstrap predictions.

PARAMETER	DESCRIPTION
`preds`	Bootstrap predictions, shape (n_subjects, n_bootstrap_iters). TYPE: `ndarray`
`y_true`	True labels, shape (n_subjects,). TYPE: `ndarray`

RETURNS	DESCRIPTION
`dict`	Dictionary with 'scalars' (AURC, MI, entropy) and 'arrays'.

Source code in src/stats/uncertainty_quantification.py

def uncertainty_metrics(preds, y_true):
    """
    Compute all uncertainty metrics for bootstrap predictions.

    Parameters
    ----------
    preds : np.ndarray
        Bootstrap predictions, shape (n_subjects, n_bootstrap_iters).
    y_true : np.ndarray
        True labels, shape (n_subjects,).

    Returns
    -------
    dict
        Dictionary with 'scalars' (AURC, MI, entropy) and 'arrays'.
    """
    # Entropy, Mutual Information, epistemic and aleatoric uncertainty
    metrics = get_uncertainties(preds)

    # AURC (Risk-Coverage curve), the std variant
    (
        metrics["arrays"]["coverage"],
        metrics["arrays"]["risk"],
        metrics["scalars"]["AURC"],
        metrics["scalars"]["AURC_E"],
    ) = risk_coverage_wrapper(preds, y_true)

    return metrics

uncertainty_wrapper ¶

uncertainty_wrapper(
    preds: ndarray,
    y_true: ndarray,
    key: str,
    split: str,
    return_placeholder: bool = False,
)

See predict_and_decompose_uncertainty_tf() in uncertainty_baselines This implementation copied straight from there (Nado et al. 2022, https://arxiv.org/abs/2106.04015) epistemic uncertainty (MI), and aleatoric uncertainty (expected entropy) https://github.com/google/uncertainty-baselines

See also: https://torch-uncertainty.github.io/api.html#diversity https://github.com/kyle-dorman/bayesian-neural-network-blogpost https://github.com/yizhanyang/Uncertainty-Estimation-BNN/blob/master/main.py https://github.com/yaringal/ConcreteDropout/blob/master/concrete-dropout-pytorch.ipynb https://github.com/rutgervandeleur/uncertainty/tree/master https://github.com/Kyushik/Predictive-Uncertainty-Estimation-using-Deep-Ensemble/blob/master/Ensemble_Regression_ToyData_Torch.ipynb means = torch.stack([tup[0] for tup in MC_samples]).view(K_test, X_val.shape[0]).cpu().data.numpy() logvar = torch.stack([tup[1] for tup in MC_samples]).view(K_test, X_val.shape[0]).cpu().data.numpy() epistemic_uncertainty = np.var(means, 0).mean(0) logvar = np.mean(logvar, 0) aleatoric_uncertainty = np.exp(logvar).mean(0)

See also "Uncertainty in Gradient Boosting via Ensembles", https://arxiv.org/abs/2006.10562 https://github.com/yandex-research/GBDT-uncertainty https://github.com/yandex-research/GBDT-uncertainty/blob/main/aggregate_results_classification.py - See also our tutorials on uncertainty estimation with CatBoost: https://towardsdatascience.com/tutorial-uncertainty-estimation-with-catboost-255805ff217e https://github.com/catboost/catboost/blob/master/catboost/tutorials/uncertainty/uncertainty_regression.ipynb

Maybe a conformal prediction with the classifier could be nice? https://github.com/PacktPublishing/Practical-Guide-to-Applied-Conformal-Prediction/blob/main/Chapter_05_TCP.ipynb

Args: preds: np.ndarray, shape (n_subjects, n_iters in bootstrap), class 1 probability (e.g. glaucoma probability) key: str, key for the uncertainty return_placeholder: bool, return placeholder if True

Source code in src/stats/uncertainty_quantification.py

def uncertainty_wrapper(
    preds: np.ndarray,
    y_true: np.ndarray,
    key: str,
    split: str,
    return_placeholder: bool = False,
):
    """
    See predict_and_decompose_uncertainty_tf() in uncertainty_baselines
    This implementation copied straight from there (Nado et al. 2022, https://arxiv.org/abs/2106.04015)
    epistemic uncertainty (MI), and aleatoric uncertainty (expected entropy)
    https://github.com/google/uncertainty-baselines

    See also:
        https://torch-uncertainty.github.io/api.html#diversity
        https://github.com/kyle-dorman/bayesian-neural-network-blogpost
        https://github.com/yizhanyang/Uncertainty-Estimation-BNN/blob/master/main.py
        https://github.com/yaringal/ConcreteDropout/blob/master/concrete-dropout-pytorch.ipynb
        https://github.com/rutgervandeleur/uncertainty/tree/master
        https://github.com/Kyushik/Predictive-Uncertainty-Estimation-using-Deep-Ensemble/blob/master/Ensemble_Regression_ToyData_Torch.ipynb
            means = torch.stack([tup[0] for tup in MC_samples]).view(K_test, X_val.shape[0]).cpu().data.numpy()
            logvar = torch.stack([tup[1] for tup in MC_samples]).view(K_test, X_val.shape[0]).cpu().data.numpy()
            epistemic_uncertainty = np.var(means, 0).mean(0)
            logvar = np.mean(logvar, 0)
            aleatoric_uncertainty = np.exp(logvar).mean(0)

    See also "Uncertainty in Gradient Boosting via Ensembles", https://arxiv.org/abs/2006.10562
    https://github.com/yandex-research/GBDT-uncertainty
    https://github.com/yandex-research/GBDT-uncertainty/blob/main/aggregate_results_classification.py
    - See also our tutorials on uncertainty estimation with CatBoost:
      https://towardsdatascience.com/tutorial-uncertainty-estimation-with-catboost-255805ff217e
      https://github.com/catboost/catboost/blob/master/catboost/tutorials/uncertainty/uncertainty_regression.ipynb

    Maybe a conformal prediction with the classifier could be nice?
    https://github.com/PacktPublishing/Practical-Guide-to-Applied-Conformal-Prediction/blob/main/Chapter_05_TCP.ipynb

    Args:
        preds: np.ndarray, shape (n_subjects, n_iters in bootstrap), class 1 probability (e.g. glaucoma probability)
        key: str, key for the uncertainty
        return_placeholder: bool, return placeholder if True
    """
    if return_placeholder:
        return {}
    else:
        no_subjects = preds.shape[0]
        if no_subjects > 0:
            metrics = uncertainty_metrics(preds, y_true)
            return metrics
        else:
            return {}