STRATOS Metrics¶

What is STRATOS?¶

STRATOS (STRengthening Analytical Thinking for Observational Studies) is an initiative providing guidance for predictive AI model evaluation in medicine.

Reference: Van Calster B, Collins GS, Vickers AJ, et al. "Performance evaluation of predictive AI models to support medical decisions: Overview and guidance." STRATOS Initiative Topic Group 6.

The STRATOS Core Set¶

These measures MUST be reported for clinical prediction models:

1. Discrimination¶

Question: Does the model rank patients correctly?

Metric	Description	Interpretation
AUROC	Area Under ROC Curve	0.5 = random, 1.0 = perfect

Note

AUROC is semi-proper—it measures ranking, not probability accuracy.

2. Calibration¶

Question: Do predicted probabilities match observed frequencies?

Metric	Description	Target
Calibration plot	Visual check	Points on diagonal
Calibration slope	Weak calibration	Close to 1.0
Calibration intercept	Mean calibration	Close to 0.0
O:E ratio	Observed/Expected	Close to 1.0

3. Overall Performance¶

Question: Combined discrimination + calibration?

Metric	Description	Notes
Brier score	Mean squared error	Lower is better
Scaled Brier (IPA)	Interpretable Brier	0-1 scale, higher better

4. Clinical Utility¶

Question: Is the model useful for decisions?

Metric	Description	Notes
Net Benefit	Clinical value	At threshold(s)
DCA curves	Decision Curve Analysis	Across thresholds

What NOT to Use¶

STRATOS explicitly recommends against:

❌ Metric	Problem
F1 score	Improper + ignores TN
AUPRC	Ignores TN, unclear interpretation
pAUROC	No decision-analytic basis
Accuracy	Improper for clinical thresholds ≠ 0.5

Why This Matters¶

"Selecting appropriate performance measures is essential for predictive AI models that are developed to be used in medical practice, because poorly performing models may harm patients and lead to increased costs." — Van Calster et al. 2024

Implementation in Foundation PLR¶

All experiments automatically compute STRATOS metrics:

# Automatic computation via bootstrap_evaluation
metrics = {
    "auroc": ...,
    "brier": ...,
    "calibration_slope": ...,
    "calibration_intercept": ...,
    "o_e_ratio": ...,
    "net_benefit_5pct": ...,
    "net_benefit_10pct": ...,
    "net_benefit_20pct": ...,
}

See the API Reference for implementation details.