Skip to content

Featurization

Stage 3 of the pipeline: Extracting handcrafted features from PLR signals.

Feature Types

Amplitude Bins

Histogram-based features capturing the distribution of pupil sizes:

  • Baseline diameter
  • Constriction amplitude (absolute and relative)
  • Max constriction diameter

Latency Features

Timing-based features:

Feature Description
latency_to_constriction Time to reach max constriction
latency_75pct Time to reach 75% constriction
time_to_redilation Recovery time
constriction_duration Duration of constriction phase

Velocity Features

Feature Description
max_constriction_velocity Peak constriction speed
mean_constriction_velocity Average constriction speed
max_redilation_velocity Peak recovery speed

PIPR Features

Post-Illumination Pupil Response:

Feature Description
pipr_6s PIPR at 6 seconds
pipr_10s PIPR at 10 seconds
recovery_time Time to baseline recovery

Why Handcrafted Features?

Key Finding

Handcrafted features outperform foundation model embeddings by 9 percentage points (0.830 vs 0.740 AUROC).

Foundation model embeddings were tested but underperform because:

  1. Generic embeddings don't capture domain-specific PLR physiology
  2. Handcrafted features encode expert knowledge about glaucoma biomarkers
  3. Small dataset (N=208) doesn't benefit from high-dimensional embeddings

Configuration

# Featurization is fixed (not configurable)
# Uses handcrafted features only

API Reference

flow_featurization

flow_featurization

flow_featurization(cfg: DictConfig) -> None

Main featurization flow orchestrating handcrafted and embedding features.

Initializes MLflow experiment, retrieves data sources from imputation, and runs both handcrafted featurization and optionally embedding extraction.

PARAMETER DESCRIPTION
cfg

Configuration dictionary containing PREFECT, MLFLOW, and other settings.

TYPE: DictConfig

Source code in src/featurization/flow_featurization.py
def flow_featurization(cfg: DictConfig) -> None:
    """Main featurization flow orchestrating handcrafted and embedding features.

    Initializes MLflow experiment, retrieves data sources from imputation,
    and runs both handcrafted featurization and optionally embedding extraction.

    Parameters
    ----------
    cfg : DictConfig
        Configuration dictionary containing PREFECT, MLFLOW, and other settings.
    """
    experiment_name = experiment_name_wrapper(
        experiment_name=cfg["PREFECT"]["FLOW_NAMES"]["FEATURIZATION"], cfg=cfg
    )
    logger.info("FLOW | Name: {}".format(experiment_name))
    logger.info("=====================")
    prev_experiment_name = experiment_name_wrapper(
        experiment_name=cfg["PREFECT"]["FLOW_NAMES"]["IMPUTATION"], cfg=cfg
    )

    # Initialize the MLflow experiment
    init_mlflow_experiment(mlflow_cfg=cfg["MLFLOW"], experiment_name=experiment_name)

    # Get the data sources (from imputation, and from original ground truth DuckDB database)
    sources = define_sources_for_flow(
        cfg=cfg, prev_experiment_name=prev_experiment_name, task="imputation"
    )

    # Get the handcrafed features
    flow_handcrafted_featurization(
        cfg=cfg,
        sources=sources,
        experiment_name=experiment_name,
        prev_experiment_name=prev_experiment_name,
    )

    # Get the "deep features" as in embeddings e.g. from foundation moodels
    compute_embeddings = False  # not so useful, so quick'n'dirty skip
    if compute_embeddings:
        flow_embedding(
            cfg=cfg,
            sources=sources,
            experiment_name=experiment_name,
            prev_experiment_name=prev_experiment_name,
        )