preprocess¶
Data preprocessing utilities.
Overview¶
Preprocessing functions for PLR data preparation.
preprocess_PLR
¶
get_standardization_stats
¶
Retrieve mean and standard deviation for standardization.
| PARAMETER | DESCRIPTION |
|---|---|
split
|
Data split to compute statistics from (typically 'train').
TYPE:
|
col_name
|
Column name in the data dictionary to standardize.
TYPE:
|
data_dicts_df
|
Nested dictionary containing data arrays organized by split and column.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple
|
A tuple containing: - mean : float NaN-aware mean of the specified column. - std : float NaN-aware standard deviation of the specified column. |
Source code in src/preprocess/preprocess_PLR.py
standardize_the_data_dict
¶
Apply standardization to all columns across all splits.
Transforms each data column using z-score normalization: X_standardized = (X - mean) / stdev.
| PARAMETER | DESCRIPTION |
|---|---|
mean
|
Mean value for standardization.
TYPE:
|
stdev
|
Standard deviation for standardization.
TYPE:
|
data_dicts_df
|
Nested dictionary with structure {split: {'data': {col_name: array}}}.
TYPE:
|
cfg
|
Configuration dictionary (currently unused but kept for API consistency).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Updated data dictionary with standardized values. |
Source code in src/preprocess/preprocess_PLR.py
destandardize_the_data_dict_for_featurization
¶
Destandardize data before feature extraction.
Reverses standardization to restore original scale, which is required for computing physiologically meaningful handcrafted features.
| PARAMETER | DESCRIPTION |
|---|---|
split
|
Data split identifier ('train', 'val', or 'test').
TYPE:
|
split_dict
|
Dictionary containing data for a single split.
TYPE:
|
preprocess_dict
|
Dictionary containing 'standardization' sub-dict with 'standardized', 'mean', and 'stdev' keys.
TYPE:
|
cfg
|
Configuration dictionary.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Deep copy of split_dict with destandardized data values. |
Source code in src/preprocess/preprocess_PLR.py
destandardize_the_split_dict
¶
Destandardize all non-mask columns in a split dictionary.
Applies inverse z-score transformation: X_original = X_standardized * stdev + mean.
| PARAMETER | DESCRIPTION |
|---|---|
data_dicts_df
|
Dictionary containing 'data' sub-dict with column arrays.
TYPE:
|
split
|
Data split identifier (used for logging).
TYPE:
|
stdev
|
Standard deviation used in original standardization.
TYPE:
|
mean
|
Mean used in original standardization.
TYPE:
|
cfg
|
Configuration dictionary (currently unused).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Updated dictionary with destandardized values. |
Notes
The 'mask' column is skipped as it contains boolean/integer flags, not continuous values that were standardized.
Source code in src/preprocess/preprocess_PLR.py
standardize_data_dicts
¶
Standardize all data dictionaries using training set statistics.
Computes mean and standard deviation from the training split and applies standardization across all splits. Stores computed statistics in the preprocess sub-dictionary.
| PARAMETER | DESCRIPTION |
|---|---|
data_dicts
|
Main data dictionary containing 'df' with nested split data.
TYPE:
|
cfg
|
Configuration with PREPROCESS.col_name specifying which column to use for computing statistics.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Updated data dictionary with standardized values and added 'preprocess.standardization' metadata. |
Source code in src/preprocess/preprocess_PLR.py
standardize_recons_arrays
¶
Standardize reconstruction arrays using stored statistics.
| PARAMETER | DESCRIPTION |
|---|---|
array_in
|
Input array to standardize.
TYPE:
|
stdz_dict
|
Dictionary containing 'mean' and 'stdev' for standardization.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ndarray
|
Standardized array (deep copy of input). |
Source code in src/preprocess/preprocess_PLR.py
preprocess_data_dicts
¶
Main preprocessing entry point for data dictionaries.
Applies configured preprocessing steps (currently only standardization) to the data dictionaries.
| PARAMETER | DESCRIPTION |
|---|---|
data_dicts
|
Main data dictionary containing 'df' with nested split data.
TYPE:
|
cfg
|
Configuration with PREPROCESS settings, including 'standardize' flag.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Preprocessed data dictionary. |
Source code in src/preprocess/preprocess_PLR.py
preprocess_data
¶
preprocess_PLR_data
¶
preprocess_PLR_data(
X: ndarray,
preprocess_cfg: Union[Dict[str, Any], DictConfig],
preprocess_dict: Optional[Dict[str, Any]] = None,
data_filtering: str = "gt",
split: str = "train",
) -> Tuple[ndarray, Dict[str, Any]]
Preprocess PLR data by applying standardization if configured.
| PARAMETER | DESCRIPTION |
|---|---|
X
|
Input PLR data array to preprocess.
TYPE:
|
preprocess_cfg
|
Configuration dictionary containing preprocessing settings, including 'standardize' and 'use_gt_stats_for_raw' flags.
TYPE:
|
preprocess_dict
|
Dictionary to store/retrieve precomputed statistics. Default is None.
TYPE:
|
data_filtering
|
Type of data filtering applied ('gt' for ground truth, 'raw' for raw data). Default is 'gt'.
TYPE:
|
split
|
Data split identifier ('train', 'val', or 'test'). Default is 'train'.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple
|
A tuple containing: - X : np.ndarray The preprocessed data array. - preprocess_dict : dict Updated preprocessing dictionary with computed statistics. |
Source code in src/preprocess/preprocess_data.py
if_use_precomputed
¶
if_use_precomputed(
preprocess_dict: Dict[str, Any],
preprocess_cfg: Union[Dict[str, Any], DictConfig],
split: str,
data_filtering: str,
) -> Tuple[bool, Optional[float], Optional[float], str]
Determine whether to use precomputed standardization statistics.
| PARAMETER | DESCRIPTION |
|---|---|
preprocess_dict
|
Dictionary containing previously computed statistics.
TYPE:
|
preprocess_cfg
|
Configuration dictionary with preprocessing settings.
TYPE:
|
split
|
Data split identifier ('train', 'val', or 'test').
TYPE:
|
data_filtering
|
Type of data filtering ('gt' or 'raw').
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple
|
A tuple containing: - use_precomputed : bool Whether to use precomputed statistics. - mean : float or None Precomputed mean value if available. - std : float or None Precomputed standard deviation if available. - filterkey : str The key used to retrieve statistics from the dictionary. |
Source code in src/preprocess/preprocess_data.py
log_stats_msg
¶
log_stats_msg(
mean: float,
std: float,
split: str,
data_filtering: str,
call_from: str = "precomputed",
) -> None
Log standardization statistics message for debugging.
| PARAMETER | DESCRIPTION |
|---|---|
mean
|
Mean value of the data.
TYPE:
|
std
|
Standard deviation of the data.
TYPE:
|
split
|
Data split identifier ('train', 'val', or 'test').
TYPE:
|
data_filtering
|
Type of data filtering ('gt' or 'raw').
TYPE:
|
call_from
|
Source of the call, either 'precomputed' or 'standardize'. Default is 'precomputed'.
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
NotImplementedError
|
If call_from is neither 'precomputed' nor 'standardize'. |
Source code in src/preprocess/preprocess_data.py
standardize_with_precomputed_stats
¶
standardize_with_precomputed_stats(
X: ndarray,
preprocess_dict: Dict[str, Any],
data_filtering: str,
filterkey: str,
split: str,
) -> ndarray
Standardize data using precomputed mean and standard deviation.
| PARAMETER | DESCRIPTION |
|---|---|
X
|
Input data array to standardize.
TYPE:
|
preprocess_dict
|
Dictionary containing precomputed standardization statistics.
TYPE:
|
data_filtering
|
Type of data filtering ('gt' or 'raw').
TYPE:
|
filterkey
|
Key to access the correct statistics in preprocess_dict.
TYPE:
|
split
|
Data split identifier ('train', 'val', or 'test').
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ndarray
|
Standardized data array with zero mean and unit variance. |
Source code in src/preprocess/preprocess_data.py
compute_stats_and_standardize
¶
compute_stats_and_standardize(
preprocess_dict: Dict[str, Any],
X: ndarray,
data_filtering: str,
split: str,
) -> Tuple[Dict[str, Any], ndarray]
Compute standardization statistics and apply standardization to data.
Fits a StandardScaler to the data, transforms it, and stores the computed mean and standard deviation in the preprocess dictionary.
| PARAMETER | DESCRIPTION |
|---|---|
preprocess_dict
|
Dictionary to store the computed standardization statistics.
TYPE:
|
X
|
Input data array to standardize.
TYPE:
|
data_filtering
|
Type of data filtering ('gt' or 'raw').
TYPE:
|
split
|
Data split identifier ('train', 'val', or 'test').
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple
|
A tuple containing: - preprocess_dict : dict Updated dictionary with computed mean and std. - X : np.ndarray Standardized data array. |
Source code in src/preprocess/preprocess_data.py
print_stdz_stats
¶
Print standardization statistics from a fitted scaler.
Logs the mean and scale values at INFO level for training ground truth, and at DEBUG level for other splits/filters to reduce log clutter.
| PARAMETER | DESCRIPTION |
|---|---|
scaler
|
Fitted StandardScaler object containing mean_ and scale_ attributes.
TYPE:
|
split
|
Data split identifier ('train', 'val', or 'test').
TYPE:
|
data_filtering
|
Type of data filtering ('gt' or 'raw').
TYPE:
|
Source code in src/preprocess/preprocess_data.py
debug_triplet_stats
¶
Log debug statistics for the data filtering triplet.
Computes and logs mean, standard deviation, and NaN count for ground truth, ground truth with missing values, and raw data.
| PARAMETER | DESCRIPTION |
|---|---|
X_gt
|
Ground truth data array.
TYPE:
|
X_gt_missing
|
Ground truth data with missing values (NaNs).
TYPE:
|
X_raw
|
Raw unprocessed data array.
TYPE:
|
split
|
Data split identifier ('train', 'val', or 'test').
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
None
|
|
Source code in src/preprocess/preprocess_data.py
destandardize_for_imputation_metric
¶
destandardize_for_imputation_metric(
targets: ndarray,
predictions: ndarray,
stdz_dict: Dict[str, Any],
) -> Tuple[ndarray, ndarray]
Destandardize targets and predictions for computing imputation metrics.
Reverses the standardization transformation to compute metrics in the original data scale.
| PARAMETER | DESCRIPTION |
|---|---|
targets
|
Ground truth target values (potentially standardized).
TYPE:
|
predictions
|
Model predictions (potentially standardized).
TYPE:
|
stdz_dict
|
Standardization dictionary containing 'standardized' boolean, 'mean', and 'stdev' values.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple
|
A tuple containing: - targets : np.ndarray Destandardized target values. - predictions : np.ndarray Destandardized prediction values. |
Source code in src/preprocess/preprocess_data.py
destandardize_dict
¶
Destandardize the mean values in an imputation results dictionary.
| PARAMETER | DESCRIPTION |
|---|---|
imputation_dict
|
Dictionary containing imputation results with a 'mean' key.
TYPE:
|
mean
|
Mean value used for original standardization.
TYPE:
|
std
|
Standard deviation used for original standardization.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Updated imputation dictionary with destandardized mean values. |
Notes
TODO: Confidence intervals (CI) are not yet destandardized.
Source code in src/preprocess/preprocess_data.py
destandardize_numpy
¶
Reverse standardization on a numpy array.
Applies the inverse transformation: X_original = X_standardized * std + mean.
| PARAMETER | DESCRIPTION |
|---|---|
X
|
Standardized data array.
TYPE:
|
mean
|
Mean value used for original standardization.
TYPE:
|
std
|
Standard deviation used for original standardization.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ndarray
|
Destandardized data array in original scale. |
Source code in src/preprocess/preprocess_data.py
destandardize_for_imputation_metrics
¶
destandardize_for_imputation_metrics(
targets: ndarray,
predictions: ndarray,
preprocess_dict: Dict[str, Any],
) -> Tuple[ndarray, ndarray]
Destandardize targets and predictions with automatic scale detection.
Detects if predictions and targets are on different scales (one destandardized, one not) and corrects accordingly before returning both in the original scale.
| PARAMETER | DESCRIPTION |
|---|---|
targets
|
Ground truth target values.
TYPE:
|
predictions
|
Model predictions.
TYPE:
|
preprocess_dict
|
Dictionary containing 'standardization' sub-dict with 'standardized', 'mean', and 'stdev' keys.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple
|
A tuple containing: - targets : np.ndarray Destandardized target values. - predictions : np.ndarray Destandardized prediction values. |
Notes
If predictions are more than 100x larger than targets in absolute mean, assumes predictions were already destandardized and only destandardizes targets.
Source code in src/preprocess/preprocess_data.py
preprocess_utils
¶
compute_stats_per_split
¶
Compute and log basic statistics for a data split.
Calculates mean and standard deviation using NaN-aware functions and logs the results for debugging purposes.
| PARAMETER | DESCRIPTION |
|---|---|
X
|
Data array for which to compute statistics.
TYPE:
|
split_name
|
Name of the data split (e.g., 'train_gt', 'val_missing') used for logging context.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary containing 'mean' and 'std' statistics. |
Notes
Train splits are expected to have near-perfect standardization (mean=0, std=1), while validation splits may deviate slightly. Missing data splits differ from ground truth due to masking applied after standardization.