featurization¶
Feature extraction from PLR signals.
Overview¶
This module extracts handcrafted physiological features:
- Amplitude bins (histogram features)
- Latency features (timing)
- Velocity features
- PIPR (Post-Illumination Pupil Response)
Main Entry Point¶
flow_featurization
¶
flow_featurization
¶
Main featurization flow orchestrating handcrafted and embedding features.
Initializes MLflow experiment, retrieves data sources from imputation, and runs both handcrafted featurization and optionally embedding extraction.
| PARAMETER | DESCRIPTION |
|---|---|
cfg
|
Configuration dictionary containing PREFECT, MLFLOW, and other settings.
TYPE:
|
Source code in src/featurization/flow_featurization.py
PLR Featurization¶
featurize_PLR
¶
featurize_subject
¶
featurize_subject(
subject_dict: dict,
subject_code: str,
cfg: DictConfig,
feature_cfg: DictConfig,
i: int,
feature_col: str = "X",
)
Compute all features for a single subject.
Extracts features for each light color and combines with metadata.
| PARAMETER | DESCRIPTION |
|---|---|
subject_dict
|
Dictionary containing subject data arrays.
TYPE:
|
subject_code
|
Unique subject identifier.
TYPE:
|
cfg
|
Main configuration dictionary.
TYPE:
|
feature_cfg
|
Feature-specific configuration.
TYPE:
|
i
|
Subject index in the dataset.
TYPE:
|
feature_col
|
Column name for feature values, by default 'X'.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary with features per color and metadata. |
Source code in src/featurization/featurize_PLR.py
compute_features_from_dict
¶
compute_features_from_dict(
split_dict: dict,
split: str,
preprocess_dict: dict,
feature_cfg: DictConfig,
cfg: DictConfig,
)
Compute features for all subjects in a data split.
Destandardizes data if needed, then iterates through subjects to compute hand-crafted PLR features.
| PARAMETER | DESCRIPTION |
|---|---|
split_dict
|
Dictionary containing split data with 'data' and 'X' arrays.
TYPE:
|
split
|
Split name (e.g., 'train', 'test').
TYPE:
|
preprocess_dict
|
Preprocessing statistics for destandardization.
TYPE:
|
feature_cfg
|
Feature configuration.
TYPE:
|
cfg
|
Main configuration dictionary.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary keyed by subject_code containing computed features. |
Source code in src/featurization/featurize_PLR.py
get_handcrafted_PLR_features
¶
Extract handcrafted PLR features from source data.
Processes all splits, computes features per subject, and flattens the nested structure into dataframes.
| PARAMETER | DESCRIPTION |
|---|---|
source_data
|
Source data dictionary with 'df', 'preprocess', and 'mlflow' keys.
TYPE:
|
cfg
|
Main configuration dictionary.
TYPE:
|
feature_cfg
|
Feature-specific configuration.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary with 'data' (dataframes per split) and 'mlflow_run'. |
Source code in src/featurization/featurize_PLR.py
featurization_script
¶
featurization_script(
experiment_name: str,
prev_experiment_name: str,
cfg: DictConfig,
source_name: str,
source_data: dict,
featurization_method: str,
feature_cfg: DictConfig,
run_name: str,
)
Execute the featurization pipeline for a single source.
Runs featurization with MLflow tracking, logging parameters, metrics, and artifacts. Supports handcrafted features and embeddings.
| PARAMETER | DESCRIPTION |
|---|---|
experiment_name
|
MLflow experiment name for featurization.
TYPE:
|
prev_experiment_name
|
Previous experiment name (imputation).
TYPE:
|
cfg
|
Main configuration dictionary.
TYPE:
|
source_name
|
Name of the data source being featurized.
TYPE:
|
source_data
|
Source data dictionary.
TYPE:
|
featurization_method
|
Method name ('handcrafted_features' or 'embeddings').
TYPE:
|
feature_cfg
|
Feature configuration.
TYPE:
|
run_name
|
MLflow run name.
TYPE:
|
Source code in src/featurization/featurize_PLR.py
192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 | |
featurizer_PLR_subject
¶
nan_auc
¶
Compute AUC while handling NaN values.
| PARAMETER | DESCRIPTION |
|---|---|
x
|
X values for AUC computation.
TYPE:
|
y
|
Y values for AUC computation.
TYPE:
|
method
|
Method for handling NaNs, by default ''.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float
|
Computed AUC score. |
| RAISES | DESCRIPTION |
|---|---|
NotImplementedError
|
This function is not yet implemented. |
Source code in src/featurization/featurizer_PLR_subject.py
compute_AUC
¶
Compute area under the curve for a time series.
| PARAMETER | DESCRIPTION |
|---|---|
y
|
Y values (e.g., pupil size measurements).
TYPE:
|
fps
|
Frames per second for time axis calculation, by default 30.
TYPE:
|
return_abs_AUC
|
If True, return absolute value of AUC, by default False.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float
|
AUC value, or NaN if y contains NaN values. |
Source code in src/featurization/featurizer_PLR_subject.py
compute_feature
¶
compute_feature(
feature_samples: DataFrame,
feature: str,
feature_params: dict[str, Any],
feature_col: str = "imputation_mean",
) -> dict[str, Any]
Compute a single feature from sampled time points.
Dispatches to amplitude or timing feature computation based on feature_params['measure'].
| PARAMETER | DESCRIPTION |
|---|---|
feature_samples
|
Dataframe with time points within the feature window.
TYPE:
|
feature
|
Feature name being computed.
TYPE:
|
feature_params
|
Feature parameters including 'measure' and 'stat'.
TYPE:
|
feature_col
|
Column name for feature values, by default 'imputation_mean'.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Feature dictionary with 'value', 'std', 'ci_pos', and 'ci_neg'. |
| RAISES | DESCRIPTION |
|---|---|
NotImplementedError
|
If feature measure type is not 'amplitude' or 'timing'. |
Source code in src/featurization/featurizer_PLR_subject.py
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 | |
get_individual_feature
¶
get_individual_feature(
df_subject: DataFrame,
light_timing: dict[str, Any],
feature_cfg: DictConfig,
color: str,
feature: str,
feature_params: dict[str, Any],
feature_col: str = "mean",
) -> Optional[dict[str, Any]]
Extract a single feature for a subject at a specific light color.
Converts relative timing to absolute, extracts samples within the time window, and computes the feature value.
| PARAMETER | DESCRIPTION |
|---|---|
df_subject
|
Subject dataframe with time series data.
TYPE:
|
light_timing
|
Light timing information with onset/offset times.
TYPE:
|
feature_cfg
|
Feature configuration.
TYPE:
|
color
|
Light color ('Red' or 'Blue').
TYPE:
|
feature
|
Feature name to compute.
TYPE:
|
feature_params
|
Feature parameters with timing and statistic info.
TYPE:
|
feature_col
|
Column name for feature values, by default 'mean'.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict or None
|
Feature dictionary with value, std, and CI, or None on error. |
| RAISES | DESCRIPTION |
|---|---|
Exception
|
Re-raised if error occurs during feature extraction. |
Source code in src/featurization/featurizer_PLR_subject.py
get_features_per_color
¶
get_features_per_color(
df_subject: DataFrame,
light_timing: dict[str, Any],
bin_cfg: DictConfig,
color: str,
feature_col: str,
) -> dict[str, Optional[dict[str, Any]]]
Compute all configured features for a specific light color.
| PARAMETER | DESCRIPTION |
|---|---|
df_subject
|
Subject dataframe with time series data.
TYPE:
|
light_timing
|
Light timing information with onset/offset times.
TYPE:
|
bin_cfg
|
Configuration defining which features to compute.
TYPE:
|
color
|
Light color ('Red' or 'Blue').
TYPE:
|
feature_col
|
Column name for feature values.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary keyed by feature name containing feature dictionaries. |
Source code in src/featurization/featurizer_PLR_subject.py
check_that_features_are_not_the_same_for_colors
¶
check_that_features_are_not_the_same_for_colors(
features: dict[str, dict[str, dict[str, Any]]],
) -> None
Validate that red and blue light features are different.
Ensures that features computed for different light colors are not identical, which would indicate a data processing error.
| PARAMETER | DESCRIPTION |
|---|---|
features
|
Dictionary keyed by color containing feature dictionaries.
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If feature values are identical for both colors. |
Source code in src/featurization/featurizer_PLR_subject.py
Feature Utilities¶
feature_utils
¶
if_refeaturize
¶
Determine whether featurization should be re-run.
Checks if re-featurization is forced by config or if no existing features are found on disk.
| PARAMETER | DESCRIPTION |
|---|---|
cfg
|
Configuration dictionary containing PLR_FEATURIZATION settings.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if featurization should be performed, False if existing features should be loaded from disk. |
Source code in src/featurization/feature_utils.py
combine_df_with_outputs
¶
combine_df_with_outputs(
df: DataFrame,
data_dict: dict[str, ndarray],
imputation: dict[str, Any],
split: str,
split_key: str,
model_name: str,
) -> DataFrame
Combine dataframe with imputation outputs and standardized inputs.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input dataframe to combine with.
TYPE:
|
data_dict
|
Dictionary containing standardized input data arrays.
TYPE:
|
imputation
|
Dictionary containing imputation results.
TYPE:
|
split
|
Data split identifier (e.g., 'train', 'test').
TYPE:
|
split_key
|
Split key identifier (e.g., 'train_gt', 'train_raw').
TYPE:
|
model_name
|
Name of the imputation model.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Dataframe with imputation and standardized input columns added. |
Source code in src/featurization/feature_utils.py
combine_standardized_inputs_with_df
¶
Add standardized input arrays as columns to a Polars dataframe.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input dataframe to add columns to.
TYPE:
|
data_dict
|
Dictionary with keys as column names and values as numpy arrays.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Dataframe with new columns prefixed with 'standardized_'. |
| RAISES | DESCRIPTION |
|---|---|
AssertionError
|
If number of samples in array doesn't match dataframe rows. |
Source code in src/featurization/feature_utils.py
combine_inputation_with_df
¶
Add imputation results as columns to a Polars dataframe.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input dataframe to add columns to.
TYPE:
|
imputation
|
Dictionary containing 'imputation' sub-dict with arrays and 'indicating_mask' for missingness.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Dataframe with imputation columns and missingness mask added. |
| RAISES | DESCRIPTION |
|---|---|
AssertionError
|
If number of samples in arrays doesn't match dataframe rows. |
Source code in src/featurization/feature_utils.py
subjects_with_class_labels
¶
Get unique subject codes that have class labels (glaucoma/control).
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Dataframe containing subject data with 'subject_code' and 'class_label'.
TYPE:
|
split
|
Data split identifier (e.g., 'train', 'test').
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Sorted dataframe with unique subject codes that have non-null class labels. |
Source code in src/featurization/feature_utils.py
pick_correct_split
¶
pick_correct_split(
data_dict: dict[str, Any],
split: str,
split_key: str,
eval_results: dict[str, Any],
model_name: str,
standardize_stats: dict[str, dict[str, float]],
) -> dict[str, Any]
Select and destandardize evaluation results for the correct data split.
| PARAMETER | DESCRIPTION |
|---|---|
data_dict
|
Dictionary containing preprocessed data.
TYPE:
|
split
|
Data split name (e.g., 'train', 'test').
TYPE:
|
split_key
|
Split key containing 'gt' or 'raw' suffix.
TYPE:
|
eval_results
|
Dictionary containing evaluation results keyed by split_key.
TYPE:
|
model_name
|
Name of the model being processed.
TYPE:
|
standardize_stats
|
Dictionary with 'gt' and 'raw' sub-dicts containing mean/std.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Destandardized results for the specified split key. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If split_key doesn't contain 'gt' or 'raw'. |
Source code in src/featurization/feature_utils.py
pick_input_data
¶
pick_input_data(
input_data: dict[str, ndarray],
split: str,
split_key: str,
model_name: str,
) -> dict[str, ndarray]
Select input data arrays for the correct split and data type.
| PARAMETER | DESCRIPTION |
|---|---|
input_data
|
Dictionary containing preprocessed data with keys like 'X_train_gt'.
TYPE:
|
split
|
Data split name, e.g., 'train'.
TYPE:
|
split_key
|
Split key, e.g., 'train_gt' or 'train_raw'.
TYPE:
|
model_name
|
Name of the model, e.g., 'CSDI'.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary with 'X' (selected data) and 'X_gt' (ground truth). |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If split_key doesn't contain 'gt' or 'raw'. |
Source code in src/featurization/feature_utils.py
get_light_stimuli_timings
¶
Extract light stimulus onset/offset timings for red and blue colors.
| PARAMETER | DESCRIPTION |
|---|---|
df_subject
|
Subject dataframe containing 'Red', 'Blue', and 'time' columns.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary with 'Red' and 'Blue' keys, each containing: - 'light_onset': float, time of light onset - 'light_offset': float, time of light offset - 'light_duration': float, duration of light stimulus |
Source code in src/featurization/feature_utils.py
get_top1_of_col
¶
Get the first or last row (by time) where a column has non-zero values.
Used to find light onset (first timepoint where light=1) or light offset (last timepoint where light=1).
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input dataframe with 'time' column.
TYPE:
|
col
|
Column name to filter by (e.g., 'Red', 'Blue').
TYPE:
|
descending
|
If True, get the LAST timepoint (light offset). If False, get the FIRST timepoint (light onset).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Single-row dataframe with the onset or offset timepoint. |
| RAISES | DESCRIPTION |
|---|---|
AssertionError
|
If no samples remain after filtering null values. |
Source code in src/featurization/feature_utils.py
replace_zeros_with_null
¶
Replace zero values with NaN in a specified column.
Used to identify light onset/offset where zero indicates light-off periods.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input dataframe.
TYPE:
|
col
|
Column name to process.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Dataframe with zeros replaced by NaN in the specified column. |
Source code in src/featurization/feature_utils.py
convert_relative_timing_to_absolute_timing
¶
convert_relative_timing_to_absolute_timing(
light_timing: dict[str, float],
feature_params: dict[str, Any],
color: str,
feature: str,
feature_cfg: DictConfig,
) -> dict[str, Any]
Convert relative feature timing to absolute timing based on light stimulus.
| PARAMETER | DESCRIPTION |
|---|---|
light_timing
|
Dictionary with 'light_onset' and 'light_offset' times.
TYPE:
|
feature_params
|
Feature parameters with 'time_from', 'time_start', and 'time_end'.
TYPE:
|
color
|
Light color ('Red' or 'Blue').
TYPE:
|
feature
|
Feature name being computed.
TYPE:
|
feature_cfg
|
Feature configuration.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Updated feature_params with absolute 'time_start' and 'time_end'. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If 'time_from' is not 'onset' or 'offset'. |
Source code in src/featurization/feature_utils.py
get_feature_samples
¶
get_feature_samples(
df_subject: DataFrame,
feature_params: dict[str, Any],
col: str = "time",
feature: Optional[str] = None,
) -> DataFrame
Filter dataframe to samples within a time window for feature extraction.
| PARAMETER | DESCRIPTION |
|---|---|
df_subject
|
Subject dataframe with time series data.
TYPE:
|
feature_params
|
Dictionary with 'time_start' and 'time_end' defining the window.
TYPE:
|
col
|
Column name for time values, by default 'time'.
TYPE:
|
feature
|
Feature name for logging, by default None.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Filtered dataframe with samples within the time window. |
Notes
Uses pandas conversion to avoid Polars Rust errors with Object arrays. See https://github.com/pola-rs/polars/issues/18399
Source code in src/featurization/feature_utils.py
flatten_dict_to_dataframe
¶
flatten_dict_to_dataframe(
features_nested: dict[str, dict[str, Any]],
mlflow_series: Optional[Series],
cfg: DictConfig,
) -> dict[str, Any]
Convert nested features dictionary to flat dataframe structure.
| PARAMETER | DESCRIPTION |
|---|---|
features_nested
|
Nested dictionary keyed by split, then by subject code, containing features.
TYPE:
|
mlflow_series
|
MLflow run information as a Polars series.
TYPE:
|
cfg
|
Configuration dictionary.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary with 'data' (containing flattened dataframes per split) and 'mlflow_run' information. |
Source code in src/featurization/feature_utils.py
flatten_subject_dicts_to_df
¶
flatten_subject_dicts_to_df(
subjects_as_dicts: dict[str, dict[str, Any]],
cfg: DictConfig,
) -> DataFrame
Convert subject-wise feature dictionaries to a single dataframe.
| PARAMETER | DESCRIPTION |
|---|---|
subjects_as_dicts
|
Dictionary keyed by subject_code containing feature dictionaries.
TYPE:
|
cfg
|
Configuration dictionary.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Dataframe with one row per subject and feature columns. |
Source code in src/featurization/feature_utils.py
create_df_row
¶
Create a single dataframe row from a subject's feature dictionary.
| PARAMETER | DESCRIPTION |
|---|---|
subject_dict
|
Dictionary containing features keyed by color, then feature name.
TYPE:
|
subject_code
|
Unique identifier for the subject.
TYPE:
|
cfg
|
Configuration dictionary with stat_keys to extract.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Single-row dataframe with flattened feature columns. |
Source code in src/featurization/feature_utils.py
get_features_fpath
¶
Construct the file path for saving/loading features.
| PARAMETER | DESCRIPTION |
|---|---|
cfg
|
Configuration dictionary with DATA and ARTIFACTS settings.
TYPE:
|
service_name
|
Service name for artifacts directory, by default 'best_models'.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Full file path for the features file. |
Source code in src/featurization/feature_utils.py
export_features_to_disk
¶
Save features dictionary to disk as a pickle file.
| PARAMETER | DESCRIPTION |
|---|---|
dict_out
|
Features dictionary to save.
TYPE:
|
cfg
|
Configuration dictionary for determining file path.
TYPE:
|
Source code in src/featurization/feature_utils.py
load_features_from_disk
¶
Load features dictionary from disk.
| PARAMETER | DESCRIPTION |
|---|---|
cfg
|
Configuration dictionary for determining file path.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Loaded features dictionary. |
Source code in src/featurization/feature_utils.py
get_feature_names
¶
get_feature_names(
features: dict[str, Any],
cols_exclude: tuple[str, ...] = ("subject_code",),
name_substring: str = "_value",
) -> list[str]
Extract feature names from a nested features dictionary.
| PARAMETER | DESCRIPTION |
|---|---|
features
|
Nested features dictionary with source -> data -> split structure.
TYPE:
|
cols_exclude
|
Column names to exclude, by default ('subject_code',).
TYPE:
|
name_substring
|
Substring to filter column names, by default '_value'.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
List of feature names with the substring removed. |
Source code in src/featurization/feature_utils.py
get_split_keys
¶
get_split_keys(
features: dict[str, Any],
model_exclude: str = "BASELINE_GT",
return_suffix: bool = True,
) -> Optional[list[str]]
Get split key suffixes from features dictionary.
| PARAMETER | DESCRIPTION |
|---|---|
features
|
Features dictionary keyed by split.
TYPE:
|
model_exclude
|
Model name to exclude from search, by default 'BASELINE_GT'.
TYPE:
|
return_suffix
|
If True, return only the suffix; if False, return full keys.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
List of split key suffixes (e.g., ['_gt', '_raw']). |
Source code in src/featurization/feature_utils.py
data_for_featurization_wrapper
¶
Prepare data dictionaries for featurization from artifacts.
Combines imputed data from models with baseline input data.
| PARAMETER | DESCRIPTION |
|---|---|
artifacts
|
Dictionary containing model artifacts with imputation results.
TYPE:
|
cfg
|
Configuration dictionary.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Combined dictionary of imputed and baseline data ready for featurization. |
Source code in src/featurization/feature_utils.py
get_baseline_input_data_for_featurization
¶
get_baseline_input_data_for_featurization(
artifacts: dict[str, Any],
model_name: str,
split_names: list[str],
) -> dict[str, Any]
Extract baseline input data (GT and raw) formatted for featurization.
Creates pseudo-imputation dictionaries from original input data to enable consistent processing with actual imputation outputs.
| PARAMETER | DESCRIPTION |
|---|---|
artifacts
|
Dictionary containing model artifacts.
TYPE:
|
model_name
|
Name of a model to extract metadata from.
TYPE:
|
split_names
|
List of split names (e.g., ['train', 'test']).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary with 'BASELINE_GT' and 'BASELINE_RAW' data structures. |
Source code in src/featurization/feature_utils.py
get_imputed_data_for_featurization
¶
get_imputed_data_for_featurization(
artifacts: dict[str, Any], cfg: DictConfig
) -> tuple[dict[str, Any], str, Any]
Extract imputed data from all models for featurization.
| PARAMETER | DESCRIPTION |
|---|---|
artifacts
|
Dictionary keyed by model name containing imputation results.
TYPE:
|
cfg
|
Configuration dictionary.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple
|
(imputed_data, model_name, split_names) where imputed_data is the nested dictionary of imputation results. |
Source code in src/featurization/feature_utils.py
imputed_data_by_split_key
¶
imputed_data_by_split_key(
imputation: dict[str, Any],
metadata: dict[str, Any],
split: str,
split_key: str,
) -> dict[str, Any]
Package imputation data with metadata for a specific split key.
| PARAMETER | DESCRIPTION |
|---|---|
imputation
|
Imputation results dictionary.
TYPE:
|
metadata
|
Metadata dictionary for the split.
TYPE:
|
split
|
Split name (e.g., 'train', 'test').
TYPE:
|
split_key
|
Split key (e.g., 'gt', 'raw').
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary with 'data' and 'metadata' keys. |
Source code in src/featurization/feature_utils.py
get_pseudoimputation_dicts_from_input_data
¶
get_pseudoimputation_dicts_from_input_data(
model_artifacts: dict[str, Any], split: str
) -> dict[str, dict[str, Any]]
Create imputation-like dictionaries from raw input data.
Converts ground truth and raw data arrays into the same format as imputation model outputs for consistent downstream processing.
| PARAMETER | DESCRIPTION |
|---|---|
model_artifacts
|
Model artifacts containing data_input with ground truth and raw data.
TYPE:
|
split
|
Split name (e.g., 'train', 'test').
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary with 'gt' and 'raw' keys containing pseudo-imputation dicts. |
Source code in src/featurization/feature_utils.py
get_dict_PLR_per_code
¶
Extract PLR data for a single subject from a data dictionary.
| PARAMETER | DESCRIPTION |
|---|---|
data_dict
|
Dictionary containing numpy arrays with shape (n_subjects, n_timepoints, 1).
TYPE:
|
i
|
Subject index to extract.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary with PLR data arrays for the specified subject. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If data type is not a number or numpy array. |
Source code in src/featurization/feature_utils.py
subjectwise_df_for_featurization
¶
subjectwise_df_for_featurization(
data_dict_subj: dict[str, Any],
metadata_subject: DataFrame,
subject_code: str,
cfg: DictConfig,
i: Optional[int] = None,
) -> DataFrame
Create a subject-specific dataframe for featurization.
Combines PLR time series data with subject metadata into a single dataframe.
| PARAMETER | DESCRIPTION |
|---|---|
data_dict_subj
|
Dictionary containing PLR data arrays for the subject.
TYPE:
|
metadata_subject
|
Subject metadata as a Polars dataframe.
TYPE:
|
subject_code
|
Unique subject identifier.
TYPE:
|
cfg
|
Configuration dictionary with PLR_length.
TYPE:
|
i
|
Subject index for extraction, by default None.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Combined dataframe with PLR data and metadata. |
| RAISES | DESCRIPTION |
|---|---|
AssertionError
|
If dataframe length doesn't match expected PLR length. |
Source code in src/featurization/feature_utils.py
drop_useless_metadata_cols
¶
Remove unnecessary metadata columns from subject dataframe.
| PARAMETER | DESCRIPTION |
|---|---|
metadata_subject
|
Subject metadata dataframe.
TYPE:
|
i
|
Subject index (used for logging only on first subject).
TYPE:
|
cfg
|
Configuration with DROP_COLS and DROP_COLS_EXTRA lists.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Dataframe with specified columns removed. |
Source code in src/featurization/feature_utils.py
subjectdict_to_df
¶
Convert a subject's PLR dictionary to a Polars dataframe.
| PARAMETER | DESCRIPTION |
|---|---|
dict_PLR
|
Dictionary with PLR data arrays keyed by data type.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Dataframe with one column per data type. |
Source code in src/featurization/feature_utils.py
get_df_subject_per_code
¶
Filter dataframe to get data for a specific subject.
| PARAMETER | DESCRIPTION |
|---|---|
data_df
|
Full dataframe containing all subjects.
TYPE:
|
subject_code
|
Subject code to filter for.
TYPE:
|
cfg
|
Configuration with PLR_length for validation.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Filtered dataframe for the specified subject. |
| RAISES | DESCRIPTION |
|---|---|
AssertionError
|
If filtered dataframe length doesn't match expected PLR length. |
Source code in src/featurization/feature_utils.py
get_metadata_row
¶
Extract scalar metadata from the first row of subject dataframe.
| PARAMETER | DESCRIPTION |
|---|---|
df_subject
|
Subject dataframe with repeated metadata across all timepoints.
TYPE:
|
cfg
|
Configuration dictionary (unused but kept for API consistency).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
First row containing scalar metadata values. |
Source code in src/featurization/feature_utils.py
get_feature_cfg_hash
¶
Generate a hash string for feature configuration.
| PARAMETER | DESCRIPTION |
|---|---|
subcfg
|
Feature configuration dictionary.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Hash string for the configuration (currently returns placeholder). |
Notes
Not fully implemented - returns 'dummyHash'.
Source code in src/featurization/feature_utils.py
export_features_pickle_file
¶
Export features dictionary to a pickle file.
| PARAMETER | DESCRIPTION |
|---|---|
features
|
Features dictionary with structure: - data: dict with 'train' and 'test' pl.DataFrames - mlflow_run: MLflow run information
TYPE:
|
data_source
|
Data source name used for filename.
TYPE:
|
cfg
|
Configuration dictionary.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Path to the exported pickle file. |
| RAISES | DESCRIPTION |
|---|---|
Exception
|
If saving fails. |
Source code in src/featurization/feature_utils.py
add_feature_metadata_suffix_to_run_name
¶
Append feature metadata suffix to run name.
| PARAMETER | DESCRIPTION |
|---|---|
run_name
|
Base run name.
TYPE:
|
subcfg
|
Configuration with 'name' and 'version' keys.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Run name with appended metadata suffix. |
Source code in src/featurization/feature_utils.py
harmonize_to_imputation_dict
¶
harmonize_to_imputation_dict(
data_array: ndarray,
metadata: dict[str, Any],
split_key_fixed: str,
cfg: DictConfig,
destandardize: bool = True,
) -> dict[str, Any]
Convert raw data array to imputation dictionary format.
Optionally destandardizes the data and packages it in the same format as imputation model outputs.
| PARAMETER | DESCRIPTION |
|---|---|
data_array
|
Input data array.
TYPE:
|
metadata
|
Metadata dictionary with preprocessing stats.
TYPE:
|
split_key_fixed
|
Split key ('gt' or 'raw').
TYPE:
|
cfg
|
Configuration dictionary.
TYPE:
|
destandardize
|
Whether to destandardize the data, by default True.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary conforming to imputation output format with 'imputation_dict' and 'metadata' keys. |
Source code in src/featurization/feature_utils.py
get_original_data_per_split_key
¶
get_original_data_per_split_key(
model_dict: dict[str, Any],
cfg: DictConfig,
split_key: str,
) -> dict[str, Any]
Extract and format original data for a specific baseline type.
| PARAMETER | DESCRIPTION |
|---|---|
model_dict
|
Model dictionary containing data_input.
TYPE:
|
cfg
|
Configuration dictionary.
TYPE:
|
split_key
|
Baseline type ('BASELINE_DenoisedGT' or 'BASELINE_OutlierRemovedRaw').
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary keyed by split with harmonized imputation format. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If split_key is not recognized. |
Source code in src/featurization/feature_utils.py
name_imputation_sources_for_featurization
¶
Generate featurization run names from imputation source names.
| PARAMETER | DESCRIPTION |
|---|---|
sources
|
List of imputation source names.
TYPE:
|
cfg
|
Configuration dictionary.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
List of featurization run names. |
Source code in src/featurization/feature_utils.py
get_original_data_to_results
¶
Get baseline data (GT and raw) formatted as results dictionaries.
| PARAMETER | DESCRIPTION |
|---|---|
model_dict
|
Model dictionary containing data_input.
TYPE:
|
cfg
|
Configuration dictionary.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary keyed by baseline split keys with data and mlflow_run. |
Source code in src/featurization/feature_utils.py
get_imputed_results
¶
Extract imputation results with metadata from model dictionary.
| PARAMETER | DESCRIPTION |
|---|---|
model_dict
|
Model dictionary with 'imputation', 'mlflow_run', and 'data_input'.
TYPE:
|
cfg
|
Configuration dictionary.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary with 'data', 'mlflow_run', and metadata per split. |
Source code in src/featurization/feature_utils.py
create_dict_for_featurization_from_imputation_results_and_original_data
¶
create_dict_for_featurization_from_imputation_results_and_original_data(
imputation_results: dict[str, Any], cfg: DictConfig
) -> dict[str, Any]
Create unified dictionary for featurization from imputation and baseline data.
Combines imputation model results with original baseline data (GT and raw) into a single dictionary structure for featurization.
| PARAMETER | DESCRIPTION |
|---|---|
imputation_results
|
Dictionary keyed by model name containing imputation results.
TYPE:
|
cfg
|
Configuration dictionary.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Unified dictionary with all sources ready for featurization. |
Source code in src/featurization/feature_utils.py
check_and_fix_df_schema
¶
Validate dataframe schema and raise error for Object type columns.
Polars Object type columns cause issues with filtering operations.
| PARAMETER | DESCRIPTION |
|---|---|
df_subject
|
Subject dataframe to validate.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Validated dataframe (unchanged if no issues). |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If any column has Object dtype. |
See Also
https://github.com/pola-rs/polars/issues/18399
Source code in src/featurization/feature_utils.py
feature_log
¶
get_best_outlier_detection_run
¶
get_best_outlier_detection_run(
simple_outlier_name: str,
cfg: DictConfig,
id: Optional[str] = None,
) -> Optional[DataFrame]
Retrieve the best outlier detection run from MLflow.
| PARAMETER | DESCRIPTION |
|---|---|
simple_outlier_name
|
Simplified outlier method name (e.g., 'MOMENT-gt-finetune', 'LOF').
TYPE:
|
cfg
|
Configuration dictionary with PREFECT flow names.
TYPE:
|
id
|
Specific run ID to retrieve, by default None.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame or None
|
Single-row DataFrame with best run info, or None if not found. |
Source code in src/featurization/feature_log.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | |
get_best_outlier_run
¶
get_best_outlier_run(
mlflow_run: Series, source_name: str, cfg: DictConfig
) -> tuple[Optional[DataFrame], Optional[str]]
Get best outlier detection run associated with an imputation source.
| PARAMETER | DESCRIPTION |
|---|---|
mlflow_run
|
MLflow run information for the imputation.
TYPE:
|
source_name
|
Source name containing outlier method (e.g., 'simple1.0__LOF__SAITS').
TYPE:
|
cfg
|
Configuration dictionary.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple
|
(best_outlier_run, outlier_run_id) or (None, None) for pupil-gt. |
Source code in src/featurization/feature_log.py
metrics_when_anomaly_detection_pupil_gt
¶
Log MLflow metrics for human-annotated ground truth outlier detection.
Sets perfect metrics (F1=1, FP=0) for ground truth data.
| PARAMETER | DESCRIPTION |
|---|---|
best_outlier_string
|
Metric name suffix for logging.
TYPE:
|
Source code in src/featurization/feature_log.py
featurization_mlflow_metrics_and_params
¶
featurization_mlflow_metrics_and_params(
mlflow_run: Optional[Series],
source_name: str,
cfg: DictConfig,
) -> None
Log featurization metrics and parameters to MLflow.
Logs imputation and outlier detection metrics from upstream runs.
| PARAMETER | DESCRIPTION |
|---|---|
mlflow_run
|
MLflow run series from imputation, or None for baseline data.
TYPE:
|
source_name
|
Source name containing outlier and imputation method info.
TYPE:
|
cfg
|
Configuration dictionary.
TYPE:
|
Source code in src/featurization/feature_log.py
export_features_to_mlflow
¶
Export features to local pickle and log to MLflow.
| PARAMETER | DESCRIPTION |
|---|---|
features
|
Features dictionary to export.
TYPE:
|
run_name
|
Run name used for file naming.
TYPE:
|
cfg
|
Configuration dictionary.
TYPE:
|
Source code in src/featurization/feature_log.py
log_features_to_mlflow
¶
log_features_to_mlflow(
run_name: str,
output_path: str,
mlflow_run: Optional[Series],
cfg: DictConfig,
) -> None
Log feature pickle file as MLflow artifact.
| PARAMETER | DESCRIPTION |
|---|---|
run_name
|
Run name for logging.
TYPE:
|
output_path
|
Path to the pickle file.
TYPE:
|
mlflow_run
|
MLflow run information (currently unused).
TYPE:
|
cfg
|
Configuration dictionary (currently unused).
TYPE:
|
Source code in src/featurization/feature_log.py
get_best_run_per_source
¶
get_best_run_per_source(
cfg: DictConfig,
experiment_name: str = "PLR_Featurization",
skip_embeddings: bool = True,
) -> dict[str, Series]
Get the latest MLflow run for each unique data source.
| PARAMETER | DESCRIPTION |
|---|---|
cfg
|
Configuration dictionary.
TYPE:
|
experiment_name
|
MLflow experiment name, by default 'PLR_Featurization'.
TYPE:
|
skip_embeddings
|
If True, exclude embedding sources, by default True.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary keyed by source name containing best run Series. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If no runs found for the experiment or a specific source. |
Source code in src/featurization/feature_log.py
get_mlflow_run_by_id
¶
get_mlflow_run_by_id(
run_id: str,
source: str,
data_source: Optional[str],
model_name: Optional[str],
cfg: DictConfig,
task_key: str = "OUTLIER_DETECTION",
) -> Optional[Series]
Retrieve a specific MLflow run by its ID.
| PARAMETER | DESCRIPTION |
|---|---|
run_id
|
MLflow run ID to retrieve.
TYPE:
|
source
|
Source name for logging.
TYPE:
|
data_source
|
Data source identifier.
TYPE:
|
model_name
|
Model name for logging.
TYPE:
|
cfg
|
Configuration dictionary.
TYPE:
|
task_key
|
Task key for experiment name lookup, by default 'OUTLIER_DETECTION'.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Series or None
|
Run information as Series, or None if not found. |
Source code in src/featurization/feature_log.py
import_features_per_source
¶
import_features_per_source(
source: str,
run: Series,
cfg: DictConfig,
subdir: str = "features",
) -> dict
Import features from an MLflow run artifact.
| PARAMETER | DESCRIPTION |
|---|---|
source
|
Data source name.
TYPE:
|
run
|
MLflow run information.
TYPE:
|
cfg
|
Configuration dictionary.
TYPE:
|
subdir
|
Artifact subdirectory, by default 'features'.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Features dictionary with structure: - data: dict with 'train' and 'test' pl.DataFrames - mlflow_run_imputation: MLflow run info |
| RAISES | DESCRIPTION |
|---|---|
Exception
|
If artifact download fails. |
Source code in src/featurization/feature_log.py
import_features_from_best_runs
¶
Import features from multiple best runs and add MLflow metadata.
| PARAMETER | DESCRIPTION |
|---|---|
best_runs
|
Dictionary keyed by source name with MLflow run Series.
TYPE:
|
cfg
|
Configuration dictionary.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Features dictionary with added mlflow_run_featurization and mlflow_run_outlier_detection information. |
Source code in src/featurization/feature_log.py
import_features_from_mlflow
¶
Import features from MLflow for all data sources.
Retrieves best runs and downloads feature artifacts.
| PARAMETER | DESCRIPTION |
|---|---|
cfg
|
Configuration dictionary.
TYPE:
|
experiment_name
|
MLflow experiment name, by default 'PLR_Featurization'.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Features dictionary keyed by source name. |
Source code in src/featurization/feature_log.py
Handcrafted Features¶
subflow_handcrafted_featurization
¶
flow_handcrafted_featurization
¶
flow_handcrafted_featurization(
cfg: DictConfig,
sources: dict,
experiment_name: str,
prev_experiment_name: str,
) -> None
Execute handcrafted featurization for all data sources.
Iterates through all imputation sources and feature configurations, running the featurization script for each combination.
| PARAMETER | DESCRIPTION |
|---|---|
cfg
|
Configuration dictionary with PLR_FEATURIZATION settings.
TYPE:
|
sources
|
Dictionary of data sources keyed by source name.
TYPE:
|
experiment_name
|
MLflow experiment name for featurization.
TYPE:
|
prev_experiment_name
|
Previous experiment name (imputation).
TYPE:
|
Source code in src/featurization/subflow_handcrafted_featurization.py
Embedding Features¶
subflow_embedding
¶
embedding_script
¶
embedding_script(
cfg: DictConfig,
source_name: str,
source_data: dict,
model_name: str,
embedding_cfg: DictConfig,
run_name: str,
pre_embedding_cfg: DictConfig,
)
Execute embedding extraction for a single source.
Dispatches to model-specific embedding functions.
| PARAMETER | DESCRIPTION |
|---|---|
cfg
|
Main configuration dictionary.
TYPE:
|
source_name
|
Name of the data source.
TYPE:
|
source_data
|
Source data dictionary.
TYPE:
|
model_name
|
Embedding model name (e.g., 'MOMENT').
TYPE:
|
embedding_cfg
|
Embedding-specific configuration.
TYPE:
|
run_name
|
MLflow run name.
TYPE:
|
pre_embedding_cfg
|
Pre-embedding/post-processing configuration (e.g., PCA).
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
NotImplementedError
|
If model_name is not supported. |
Source code in src/featurization/embedding/subflow_embedding.py
if_embedding_not_done
¶
Check if embedding run has already been completed.
| PARAMETER | DESCRIPTION |
|---|---|
run_name
|
Name of the run to check.
TYPE:
|
experiment_name
|
MLflow experiment name.
TYPE:
|
cfg
|
Configuration dictionary (currently unused).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if embedding needs to be computed, False if already done. |
Source code in src/featurization/embedding/subflow_embedding.py
flow_embedding
¶
Execute embedding extraction flow for all sources and configurations.
Iterates through sources, embedding models, and preprocessing methods, computing embeddings for each combination.
| PARAMETER | DESCRIPTION |
|---|---|
cfg
|
Configuration dictionary with PLR_EMBEDDING and EMBEDDING settings.
TYPE:
|
sources
|
Dictionary of data sources keyed by source name.
TYPE:
|
experiment_name
|
MLflow experiment name for featurization.
TYPE:
|
prev_experiment_name
|
Previous experiment name (imputation).
TYPE:
|
Source code in src/featurization/embedding/subflow_embedding.py
moment_embedding
¶
log_embeddings_to_mlflow
¶
log_embeddings_to_mlflow(
embeddings: dict[str, Any],
run_name: str,
model_name: str,
source_name: str,
save_as_numpy: bool = True,
) -> None
Save embeddings to disk and log as MLflow artifacts.
| PARAMETER | DESCRIPTION |
|---|---|
embeddings
|
Dictionary with 'data' containing DataFrames per split.
TYPE:
|
run_name
|
Run name for file naming.
TYPE:
|
model_name
|
Model name for file naming.
TYPE:
|
source_name
|
Source name (currently unused).
TYPE:
|
save_as_numpy
|
If True, save per-split numpy arrays, by default True.
TYPE:
|
Source code in src/featurization/embedding/moment_embedding.py
get_dataframe_from_dict
¶
get_dataframe_from_dict(
split_dict_subject: dict[str, dict[str, ndarray]],
cfg: DictConfig,
drop_col_wildcard: str = "mask",
) -> DataFrame
Convert subject dictionary to metadata DataFrame.
| PARAMETER | DESCRIPTION |
|---|---|
split_dict_subject
|
Subject dictionary with metadata and labels.
TYPE:
|
cfg
|
Configuration dictionary (currently unused).
TYPE:
|
drop_col_wildcard
|
Wildcard for columns to drop, by default 'mask'.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with metadata columns prefixed with 'metadata_'. |
Source code in src/featurization/embedding/moment_embedding.py
create_pseudo_embedding_std
¶
Create placeholder standard deviation columns for embeddings.
Creates zero-filled columns to match handcrafted feature format.
| PARAMETER | DESCRIPTION |
|---|---|
embeddings_out
|
Embedding array with shape (n_samples, n_features).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with columns 'embedding{i}_std' filled with zeros. |
Source code in src/featurization/embedding/moment_embedding.py
create_embeddings_df
¶
Create DataFrame from embedding array.
| PARAMETER | DESCRIPTION |
|---|---|
embeddings_out
|
Embedding array with shape (n_samples, n_features).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with columns 'embedding{i}_value'. |
Source code in src/featurization/embedding/moment_embedding.py
create_split_embedding_df
¶
create_split_embedding_df(
embeddings_out: ndarray,
subject_codes: list[str],
df_metadata: DataFrame,
) -> DataFrame
Create complete embedding DataFrame with metadata and codes.
| PARAMETER | DESCRIPTION |
|---|---|
embeddings_out
|
Embedding array with shape (n_samples, n_features).
TYPE:
|
subject_codes
|
List of subject identifiers.
TYPE:
|
df_metadata
|
Metadata DataFrame for subjects.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Combined DataFrame with subject_code, embeddings, std, and metadata. |
Source code in src/featurization/embedding/moment_embedding.py
get_subject_dict_for_df
¶
get_subject_dict_for_df(
embeddings_out: ndarray,
split_dict: dict[str, dict[str, ndarray]],
cfg: DictConfig,
) -> dict[str, dict[str, ndarray]]
Prepare subject dictionary for DataFrame conversion.
Extracts first timepoint from arrays for scalar metadata.
| PARAMETER | DESCRIPTION |
|---|---|
embeddings_out
|
Embedding array for validation of subject count.
TYPE:
|
split_dict
|
Split dictionary with metadata arrays.
TYPE:
|
cfg
|
Configuration dictionary (currently unused).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Subject dictionary with scalar values per category. |
| RAISES | DESCRIPTION |
|---|---|
AssertionError
|
If embedding and input subject counts don't match. |
Source code in src/featurization/embedding/moment_embedding.py
get_subject_codes
¶
Extract subject codes from split dictionary.
| PARAMETER | DESCRIPTION |
|---|---|
split_dict
|
Split dictionary with metadata containing subject_code array.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
List of subject code strings. |
Source code in src/featurization/embedding/moment_embedding.py
combine_embeddings_with_metadata_for_df
¶
combine_embeddings_with_metadata_for_df(
embeddings_out: ndarray,
split_dict: dict[str, dict[str, ndarray]],
cfg: DictConfig,
) -> DataFrame
Combine embedding array with subject metadata into DataFrame.
| PARAMETER | DESCRIPTION |
|---|---|
embeddings_out
|
Embedding array with shape (n_samples, n_features).
TYPE:
|
split_dict
|
Split dictionary with metadata.
TYPE:
|
cfg
|
Configuration dictionary.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Combined DataFrame with embeddings and metadata. |
See Also
compute_features_from_dict : Similar function for handcrafted features.
Source code in src/featurization/embedding/moment_embedding.py
get_embeddings_per_split
¶
get_embeddings_per_split(
model: Module,
dataloader: DataLoader,
split_dict: dict[str, dict[str, ndarray]],
model_cfg: DictConfig,
cfg: DictConfig,
) -> DataFrame
Compute embeddings for all batches in a data split.
| PARAMETER | DESCRIPTION |
|---|---|
model
|
MOMENT model for embedding extraction.
TYPE:
|
dataloader
|
PyTorch dataloader for the split.
TYPE:
|
split_dict
|
Split dictionary with metadata.
TYPE:
|
model_cfg
|
Model configuration.
TYPE:
|
cfg
|
Main configuration with DEVICE settings.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with embeddings and metadata. |
| RAISES | DESCRIPTION |
|---|---|
AssertionError
|
If embeddings are None (model initialization issue). |
Source code in src/featurization/embedding/moment_embedding.py
get_embeddings
¶
get_embeddings(
model: Module,
dataloaders: dict[str, DataLoader],
source_data: dict[str, Any],
model_cfg: DictConfig,
cfg: DictConfig,
) -> dict[str, Any]
Compute embeddings for all data splits.
| PARAMETER | DESCRIPTION |
|---|---|
model
|
MOMENT model for embedding extraction.
TYPE:
|
dataloaders
|
Dictionary of dataloaders keyed by split name.
TYPE:
|
source_data
|
Source data with 'df' and 'mlflow' keys.
TYPE:
|
model_cfg
|
Model configuration.
TYPE:
|
cfg
|
Main configuration.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary with 'data' (embeddings per split) and 'mlflow_run'. |
Source code in src/featurization/embedding/moment_embedding.py
import_moment_embedder
¶
Import MOMENT model configured for embedding extraction.
| PARAMETER | DESCRIPTION |
|---|---|
cfg
|
Main configuration with DEVICE settings.
TYPE:
|
model_cfg
|
Model configuration for MOMENT.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Module
|
MOMENT model ready for embedding extraction. |
See Also
https://github.com/moment-timeseries-foundation-model/moment/blob/main/tutorials/representation_learning.ipynb
Source code in src/featurization/embedding/moment_embedding.py
moment_embedder
¶
moment_embedder(
source_data: dict[str, Any],
source_name: str,
model_cfg: DictConfig,
cfg: DictConfig,
run_name: str,
model_name: str,
pre_embedding_cfg: Optional[DictConfig],
) -> None
Extract MOMENT embeddings for a data source with MLflow tracking.
Imports model, computes embeddings, optionally applies PCA, and logs results to MLflow.
| PARAMETER | DESCRIPTION |
|---|---|
source_data
|
Source data dictionary with 'df' and 'mlflow' keys.
TYPE:
|
source_name
|
Name of the data source.
TYPE:
|
model_cfg
|
MOMENT model configuration.
TYPE:
|
cfg
|
Main configuration dictionary.
TYPE:
|
run_name
|
MLflow run name.
TYPE:
|
model_name
|
Model name for logging.
TYPE:
|
pre_embedding_cfg
|
Pre-embedding configuration (e.g., PCA).
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If pre_embedding_cfg has unknown preprocessing method. |
Source code in src/featurization/embedding/moment_embedding.py
401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 | |
dim_reduction
¶
cap_dimensionality_of_PCA
¶
cap_dimensionality_of_PCA(
train_pcs: ndarray, test_pcs: ndarray, max_dim: int = 96
) -> tuple[ndarray, ndarray]
Cap PCA dimensionality to a maximum number of components.
Prevents downstream issues with classifiers that have feature limits (e.g., TabPFN max 100 features).
| PARAMETER | DESCRIPTION |
|---|---|
train_pcs
|
Training principal components with shape (n_samples, n_components).
TYPE:
|
test_pcs
|
Test principal components with shape (n_samples, n_components).
TYPE:
|
max_dim
|
Maximum number of dimensions to keep, by default 96.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple
|
(train_pcs, test_pcs) with components capped at max_dim. |
Source code in src/featurization/embedding/dim_reduction.py
apply_PCA_for_embedding
¶
Apply PCA dimensionality reduction to embeddings.
Standardizes features, fits PCA on training data, transforms both train and test, and logs results to MLflow.
| PARAMETER | DESCRIPTION |
|---|---|
embeddings
|
Dictionary with 'data' containing train/test DataFrames.
TYPE:
|
pca_config
|
Configuration with 'explained_variance' and 'max_dim'.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Embeddings with PCA-transformed features. |
Source code in src/featurization/embedding/dim_reduction.py
assign_features_back_to_full_df
¶
Replace embedding features with PCA-transformed components.
| PARAMETER | DESCRIPTION |
|---|---|
embeddings
|
Original embeddings dictionary.
TYPE:
|
train_pcs
|
PCA-transformed training data.
TYPE:
|
test_pcs
|
PCA-transformed test data.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Embeddings with replaced feature columns. |
Source code in src/featurization/embedding/dim_reduction.py
apply_dimensionality_reduction_for_feature_sources
¶
Apply dimensionality reduction to all embedding feature sources.
| PARAMETER | DESCRIPTION |
|---|---|
features
|
Features dictionary keyed by source name.
TYPE:
|
cfg
|
Configuration with DIM_REDUCTION settings.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Features with dimensionality-reduced embeddings. |
Source code in src/featurization/embedding/dim_reduction.py
torch_to_numpy
¶
Convert PyTorch tensor to numpy array.
| PARAMETER | DESCRIPTION |
|---|---|
_torch_tensor
|
Input tensor (currently unused — placeholder).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ndarray
|
Numpy array. |
Notes
Not implemented - placeholder function.
Source code in src/featurization/embedding/dim_reduction.py
get_df_features
¶
Extract feature columns (ending in '_value') from DataFrame.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame with embedding columns.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with only feature value columns. |
Source code in src/featurization/embedding/dim_reduction.py
get_feature_embedding_df
¶
get_feature_embedding_df(
df: DataFrame,
label_col: str = "metadata_class_label",
return_classes_as_int: bool = True,
) -> tuple[DataFrame, ndarray, DataFrame]
Extract features, labels, and metadata from embedding DataFrame.
Separates feature columns from metadata and encodes labels.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame with embeddings and metadata.
TYPE:
|
label_col
|
Column name for class labels, by default 'metadata_class_label'.
TYPE:
|
return_classes_as_int
|
If True, encode string labels as integers, by default True.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple
|
(features_df, labels_array, metadata_df) where features_df contains only _value columns, labels_array is encoded (0/1), and metadata_df contains remaining columns. |
| RAISES | DESCRIPTION |
|---|---|
AssertionError
|
If labels don't have exactly 2 unique values (binary classification). |
Source code in src/featurization/embedding/dim_reduction.py
combine_cols_to_out
¶
combine_cols_to_out(
embedding_train: ndarray,
embedding_test: ndarray,
train_df_out: DataFrame,
test_df_out: DataFrame,
) -> tuple[DataFrame, DataFrame]
Combine reduced embeddings with metadata DataFrames.
| PARAMETER | DESCRIPTION |
|---|---|
embedding_train
|
Reduced training embeddings.
TYPE:
|
embedding_test
|
Reduced test embeddings.
TYPE:
|
train_df_out
|
Training metadata.
TYPE:
|
test_df_out
|
Test metadata.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple
|
(df_train, df_test) combined DataFrames. |
Source code in src/featurization/embedding/dim_reduction.py
umap_wrapper
¶
Apply UMAP dimensionality reduction to embeddings.
| PARAMETER | DESCRIPTION |
|---|---|
embeddings
|
Dictionary with 'train' and 'test' DataFrames.
TYPE:
|
dim_cfg
|
Configuration with n_neighbors, n_components, random_state, transform_seed, and supervised settings.
TYPE:
|
source_name
|
Source name for logging (currently unused).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Embeddings with UMAP-reduced features. |
Notes
Uses default UMAP parameters - would need HPO for fair assessment.
Source code in src/featurization/embedding/dim_reduction.py
embedding_dim_reduction_wrapper
¶
Apply dimensionality reduction to embeddings based on configuration.
Reduces high-dimensional embeddings (e.g., 1024) to lower dimensions for visualization or classification.
| PARAMETER | DESCRIPTION |
|---|---|
embeddings
|
Dictionary with 'train' and 'test' DataFrames.
TYPE:
|
dim_cfg
|
Configuration with 'method' and method-specific parameters.
TYPE:
|
source_name
|
Source name for logging.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Embeddings with reduced dimensionality. |
| RAISES | DESCRIPTION |
|---|---|
NotImplementedError
|
If dim_cfg['method'] is not supported. |
Source code in src/featurization/embedding/dim_reduction.py
Visualization¶
visualize_features
¶
visualize_features_of_all_sources
¶
Generate and export feature visualizations for all data sources.
Creates visualizations combining features from multiple sources and logs them as MLflow artifacts.
| PARAMETER | DESCRIPTION |
|---|---|
features
|
Dictionary of features keyed by data source.
TYPE:
|
mlflow_infos
|
MLflow run information for artifact logging.
TYPE:
|
cfg
|
Configuration with VISUALIZATION settings.
TYPE:
|