STRATOS Metrics¶
What is STRATOS?¶
STRATOS (STRengthening Analytical Thinking for Observational Studies) is an initiative providing guidance for predictive AI model evaluation in medicine.
Reference: Van Calster B, Collins GS, Vickers AJ, et al. "Performance evaluation of predictive AI models to support medical decisions: Overview and guidance." STRATOS Initiative Topic Group 6.
The STRATOS Core Set¶
These measures MUST be reported for clinical prediction models:
1. Discrimination¶
Question: Does the model rank patients correctly?
| Metric | Description | Interpretation |
|---|---|---|
| AUROC | Area Under ROC Curve | 0.5 = random, 1.0 = perfect |
Note
AUROC is semi-proper—it measures ranking, not probability accuracy.
2. Calibration¶
Question: Do predicted probabilities match observed frequencies?
| Metric | Description | Target |
|---|---|---|
| Calibration plot | Visual check | Points on diagonal |
| Calibration slope | Weak calibration | Close to 1.0 |
| Calibration intercept | Mean calibration | Close to 0.0 |
| O:E ratio | Observed/Expected | Close to 1.0 |
3. Overall Performance¶
Question: Combined discrimination + calibration?
| Metric | Description | Notes |
|---|---|---|
| Brier score | Mean squared error | Lower is better |
| Scaled Brier (IPA) | Interpretable Brier | 0-1 scale, higher better |
4. Clinical Utility¶
Question: Is the model useful for decisions?
| Metric | Description | Notes |
|---|---|---|
| Net Benefit | Clinical value | At threshold(s) |
| DCA curves | Decision Curve Analysis | Across thresholds |
What NOT to Use¶
STRATOS explicitly recommends against:
| ❌ Metric | Problem |
|---|---|
| F1 score | Improper + ignores TN |
| AUPRC | Ignores TN, unclear interpretation |
| pAUROC | No decision-analytic basis |
| Accuracy | Improper for clinical thresholds ≠ 0.5 |
Why This Matters¶
"Selecting appropriate performance measures is essential for predictive AI models that are developed to be used in medical practice, because poorly performing models may harm patients and lead to increased costs." — Van Calster et al. 2024
Implementation in Foundation PLR¶
All experiments automatically compute STRATOS metrics:
# Automatic computation via bootstrap_evaluation
metrics = {
"auroc": ...,
"brier": ...,
"calibration_slope": ...,
"calibration_intercept": ...,
"o_e_ratio": ...,
"net_benefit_5pct": ...,
"net_benefit_10pct": ...,
"net_benefit_20pct": ...,
}
See the API Reference for implementation details.