Featurization¶

Stage 3 of the pipeline: Extracting handcrafted features from PLR signals.

Feature Types¶

Amplitude Bins¶

Histogram-based features capturing the distribution of pupil sizes:

Baseline diameter
Constriction amplitude (absolute and relative)
Max constriction diameter

Latency Features¶

Timing-based features:

Feature	Description
`latency_to_constriction`	Time to reach max constriction
`latency_75pct`	Time to reach 75% constriction
`time_to_redilation`	Recovery time
`constriction_duration`	Duration of constriction phase

Velocity Features¶

Feature	Description
`max_constriction_velocity`	Peak constriction speed
`mean_constriction_velocity`	Average constriction speed
`max_redilation_velocity`	Peak recovery speed

PIPR Features¶

Post-Illumination Pupil Response:

Feature	Description
`pipr_6s`	PIPR at 6 seconds
`pipr_10s`	PIPR at 10 seconds
`recovery_time`	Time to baseline recovery

Why Handcrafted Features?¶

Key Finding

Handcrafted features outperform foundation model embeddings by 9 percentage points (0.830 vs 0.740 AUROC).

Foundation model embeddings were tested but underperform because:

Generic embeddings don't capture domain-specific PLR physiology
Handcrafted features encode expert knowledge about glaucoma biomarkers
Small dataset (N=208) doesn't benefit from high-dimensional embeddings

Configuration¶

# Featurization is fixed (not configurable)
# Uses handcrafted features only

API Reference¶

flow_featurization ¶

flow_featurization(cfg: DictConfig) -> None

Main featurization flow orchestrating handcrafted and embedding features.

Initializes MLflow experiment, retrieves data sources from imputation, and runs both handcrafted featurization and optionally embedding extraction.

PARAMETER	DESCRIPTION
`cfg`	Configuration dictionary containing PREFECT, MLFLOW, and other settings. TYPE: `DictConfig`

Source code in src/featurization/flow_featurization.py

def flow_featurization(cfg: DictConfig) -> None:
    """Main featurization flow orchestrating handcrafted and embedding features.

    Initializes MLflow experiment, retrieves data sources from imputation,
    and runs both handcrafted featurization and optionally embedding extraction.

    Parameters
    ----------
    cfg : DictConfig
        Configuration dictionary containing PREFECT, MLFLOW, and other settings.
    """
    experiment_name = experiment_name_wrapper(
        experiment_name=cfg["PREFECT"]["FLOW_NAMES"]["FEATURIZATION"], cfg=cfg
    )
    logger.info("FLOW | Name: {}".format(experiment_name))
    logger.info("=====================")
    prev_experiment_name = experiment_name_wrapper(
        experiment_name=cfg["PREFECT"]["FLOW_NAMES"]["IMPUTATION"], cfg=cfg
    )

    # Initialize the MLflow experiment
    init_mlflow_experiment(mlflow_cfg=cfg["MLFLOW"], experiment_name=experiment_name)

    # Get the data sources (from imputation, and from original ground truth DuckDB database)
    sources = define_sources_for_flow(
        cfg=cfg, prev_experiment_name=prev_experiment_name, task="imputation"
    )

    # Get the handcrafed features
    flow_handcrafted_featurization(
        cfg=cfg,
        sources=sources,
        experiment_name=experiment_name,
        prev_experiment_name=prev_experiment_name,
    )

    # Get the "deep features" as in embeddings e.g. from foundation moodels
    compute_embeddings = False  # not so useful, so quick'n'dirty skip
    if compute_embeddings:
        flow_embedding(
            cfg=cfg,
            sources=sources,
            experiment_name=experiment_name,
            prev_experiment_name=prev_experiment_name,
        )