preprocess¶

Data preprocessing utilities.

Overview¶

Preprocessing functions for PLR data preparation.

preprocess_PLR ¶

get_standardization_stats ¶

get_standardization_stats(
    split: str, col_name: str, data_dicts_df: dict
)

Retrieve mean and standard deviation for standardization.

PARAMETER	DESCRIPTION
`split`	Data split to compute statistics from (typically 'train'). TYPE: `str`
`col_name`	Column name in the data dictionary to standardize. TYPE: `str`
`data_dicts_df`	Nested dictionary containing data arrays organized by split and column. TYPE: `dict`

RETURNS	DESCRIPTION
`tuple`	A tuple containing: - mean : float NaN-aware mean of the specified column. - std : float NaN-aware standard deviation of the specified column.

Source code in src/preprocess/preprocess_PLR.py

def get_standardization_stats(split: str, col_name: str, data_dicts_df: dict):
    """Retrieve mean and standard deviation for standardization.

    Parameters
    ----------
    split : str
        Data split to compute statistics from (typically 'train').
    col_name : str
        Column name in the data dictionary to standardize.
    data_dicts_df : dict
        Nested dictionary containing data arrays organized by split and column.

    Returns
    -------
    tuple
        A tuple containing:
        - mean : float
            NaN-aware mean of the specified column.
        - std : float
            NaN-aware standard deviation of the specified column.
    """
    logger.debug("Standardizing on split = {}, dict_key = {}".format(split, col_name))
    mean = np.nanmean(data_dicts_df[split]["data"][col_name])
    std = np.nanstd(data_dicts_df[split]["data"][col_name])
    return mean, std

standardize_the_data_dict ¶

standardize_the_data_dict(mean, stdev, data_dicts_df, cfg)

Apply standardization to all columns across all splits.

Transforms each data column using z-score normalization: X_standardized = (X - mean) / stdev.

PARAMETER	DESCRIPTION
`mean`	Mean value for standardization. TYPE: `float`
`stdev`	Standard deviation for standardization. TYPE: `float`
`data_dicts_df`	Nested dictionary with structure {split: {'data': {col_name: array}}}. TYPE: `dict`
`cfg`	Configuration dictionary (currently unused but kept for API consistency). TYPE: `DictConfig`

RETURNS	DESCRIPTION
`dict`	Updated data dictionary with standardized values.

Source code in src/preprocess/preprocess_PLR.py

def standardize_the_data_dict(mean, stdev, data_dicts_df, cfg):
    """Apply standardization to all columns across all splits.

    Transforms each data column using z-score normalization:
    X_standardized = (X - mean) / stdev.

    Parameters
    ----------
    mean : float
        Mean value for standardization.
    stdev : float
        Standard deviation for standardization.
    data_dicts_df : dict
        Nested dictionary with structure {split: {'data': {col_name: array}}}.
    cfg : DictConfig
        Configuration dictionary (currently unused but kept for API consistency).

    Returns
    -------
    dict
        Updated data dictionary with standardized values.
    """
    for split in data_dicts_df.keys():
        for col_name in data_dicts_df[split]["data"].keys():
            logger.debug("Standardizing column: {}".format(col_name))
            array_tmp = data_dicts_df[split]["data"][col_name]
            array_tmp = (array_tmp - mean) / stdev
            data_dicts_df[split]["data"][col_name] = array_tmp

    return data_dicts_df

destandardize_the_data_dict_for_featurization ¶

destandardize_the_data_dict_for_featurization(
    split, split_dict, preprocess_dict, cfg
)

Destandardize data before feature extraction.

Reverses standardization to restore original scale, which is required for computing physiologically meaningful handcrafted features.

PARAMETER	DESCRIPTION
`split`	Data split identifier ('train', 'val', or 'test'). TYPE: `str`
`split_dict`	Dictionary containing data for a single split. TYPE: `dict`
`preprocess_dict`	Dictionary containing 'standardization' sub-dict with 'standardized', 'mean', and 'stdev' keys. TYPE: `dict`
`cfg`	Configuration dictionary. TYPE: `DictConfig`

RETURNS	DESCRIPTION
`dict`	Deep copy of split_dict with destandardized data values.

Source code in src/preprocess/preprocess_PLR.py

def destandardize_the_data_dict_for_featurization(
    split, split_dict, preprocess_dict, cfg
):
    """Destandardize data before feature extraction.

    Reverses standardization to restore original scale, which is required
    for computing physiologically meaningful handcrafted features.

    Parameters
    ----------
    split : str
        Data split identifier ('train', 'val', or 'test').
    split_dict : dict
        Dictionary containing data for a single split.
    preprocess_dict : dict
        Dictionary containing 'standardization' sub-dict with 'standardized',
        'mean', and 'stdev' keys.
    cfg : DictConfig
        Configuration dictionary.

    Returns
    -------
    dict
        Deep copy of split_dict with destandardized data values.
    """
    if preprocess_dict["standardization"]["standardized"]:
        logger.info(
            "Destandardizing the data for featurization, split = {}".format(split)
        )
        mean = preprocess_dict["standardization"]["mean"]
        stdev = preprocess_dict["standardization"]["stdev"]
        dicts_out = deepcopy(split_dict)
        dicts_out = destandardize_the_split_dict(dicts_out, split, stdev, mean, cfg)
    else:
        logger.info("No standardization applied, so no destandardization needed")
    return dicts_out

destandardize_the_split_dict ¶

destandardize_the_split_dict(
    data_dicts_df, split, stdev, mean, cfg
)

Destandardize all non-mask columns in a split dictionary.

Applies inverse z-score transformation: X_original = X_standardized * stdev + mean.

PARAMETER	DESCRIPTION
`data_dicts_df`	Dictionary containing 'data' sub-dict with column arrays. TYPE: `dict`
`split`	Data split identifier (used for logging). TYPE: `str`
`stdev`	Standard deviation used in original standardization. TYPE: `float`
`mean`	Mean used in original standardization. TYPE: `float`
`cfg`	Configuration dictionary (currently unused). TYPE: `DictConfig`

RETURNS	DESCRIPTION
`dict`	Updated dictionary with destandardized values.

Notes

The 'mask' column is skipped as it contains boolean/integer flags, not continuous values that were standardized.

Source code in src/preprocess/preprocess_PLR.py

def destandardize_the_split_dict(data_dicts_df, split, stdev, mean, cfg):
    """Destandardize all non-mask columns in a split dictionary.

    Applies inverse z-score transformation: X_original = X_standardized * stdev + mean.

    Parameters
    ----------
    data_dicts_df : dict
        Dictionary containing 'data' sub-dict with column arrays.
    split : str
        Data split identifier (used for logging).
    stdev : float
        Standard deviation used in original standardization.
    mean : float
        Mean used in original standardization.
    cfg : DictConfig
        Configuration dictionary (currently unused).

    Returns
    -------
    dict
        Updated dictionary with destandardized values.

    Notes
    -----
    The 'mask' column is skipped as it contains boolean/integer flags,
    not continuous values that were standardized.
    """
    for col_name in data_dicts_df["data"].keys():
        if col_name != "mask":
            # or inverse transform as you wish to call this
            logger.debug("DeStandardizing column: {}".format(col_name))
            array_tmp = data_dicts_df["data"][col_name]
            array_tmp = (array_tmp * stdev) + mean
            data_dicts_df["data"][col_name] = array_tmp
    return data_dicts_df

standardize_data_dicts ¶

standardize_data_dicts(data_dicts: dict, cfg: DictConfig)

Standardize all data dictionaries using training set statistics.

Computes mean and standard deviation from the training split and applies standardization across all splits. Stores computed statistics in the preprocess sub-dictionary.

PARAMETER	DESCRIPTION
`data_dicts`	Main data dictionary containing 'df' with nested split data. TYPE: `dict`
`cfg`	Configuration with PREPROCESS.col_name specifying which column to use for computing statistics. TYPE: `DictConfig`

RETURNS	DESCRIPTION
`dict`	Updated data dictionary with standardized values and added 'preprocess.standardization' metadata.

Source code in src/preprocess/preprocess_PLR.py

def standardize_data_dicts(data_dicts: dict, cfg: DictConfig):
    """Standardize all data dictionaries using training set statistics.

    Computes mean and standard deviation from the training split and applies
    standardization across all splits. Stores computed statistics in the
    preprocess sub-dictionary.

    Parameters
    ----------
    data_dicts : dict
        Main data dictionary containing 'df' with nested split data.
    cfg : DictConfig
        Configuration with PREPROCESS.col_name specifying which column
        to use for computing statistics.

    Returns
    -------
    dict
        Updated data dictionary with standardized values and added
        'preprocess.standardization' metadata.
    """
    mean, stdev = get_standardization_stats(
        split="train",
        col_name=cfg["PREPROCESS"]["col_name"],
        data_dicts_df=data_dicts["df"],
    )

    logger.info("Standardizing, mean = {}, stdev = {}".format(mean, stdev))
    data_dicts["df"] = standardize_the_data_dict(
        mean=mean, stdev=stdev, data_dicts_df=data_dicts["df"], cfg=cfg
    )

    if "preprocess" not in data_dicts:
        data_dicts["preprocess"] = {}
        data_dicts["preprocess"]["standardization"] = {
            "standardized": True,
            "mean": mean,
            "stdev": stdev,
        }

    return data_dicts

standardize_recons_arrays ¶

standardize_recons_arrays(array_in, stdz_dict: dict)

Standardize reconstruction arrays using stored statistics.

PARAMETER	DESCRIPTION
`array_in`	Input array to standardize. TYPE: `ndarray`
`stdz_dict`	Dictionary containing 'mean' and 'stdev' for standardization. TYPE: `dict`

RETURNS	DESCRIPTION
`ndarray`	Standardized array (deep copy of input).

Source code in src/preprocess/preprocess_PLR.py

def standardize_recons_arrays(array_in, stdz_dict: dict):
    """Standardize reconstruction arrays using stored statistics.

    Parameters
    ----------
    array_in : np.ndarray
        Input array to standardize.
    stdz_dict : dict
        Dictionary containing 'mean' and 'stdev' for standardization.

    Returns
    -------
    np.ndarray
        Standardized array (deep copy of input).
    """
    array_out = deepcopy(array_in)
    array_out = array_out - stdz_dict["mean"]
    array_out = array_out / stdz_dict["stdev"]
    return array_out

preprocess_data_dicts ¶

preprocess_data_dicts(data_dicts: dict, cfg: DictConfig)

Main preprocessing entry point for data dictionaries.

Applies configured preprocessing steps (currently only standardization) to the data dictionaries.

PARAMETER	DESCRIPTION
`data_dicts`	Main data dictionary containing 'df' with nested split data. TYPE: `dict`
`cfg`	Configuration with PREPROCESS settings, including 'standardize' flag. TYPE: `DictConfig`

RETURNS	DESCRIPTION
`dict`	Preprocessed data dictionary.

Source code in src/preprocess/preprocess_PLR.py

def preprocess_data_dicts(data_dicts: dict, cfg: DictConfig):
    """Main preprocessing entry point for data dictionaries.

    Applies configured preprocessing steps (currently only standardization)
    to the data dictionaries.

    Parameters
    ----------
    data_dicts : dict
        Main data dictionary containing 'df' with nested split data.
    cfg : DictConfig
        Configuration with PREPROCESS settings, including 'standardize' flag.

    Returns
    -------
    dict
        Preprocessed data dictionary.
    """
    if cfg["PREPROCESS"]["standardize"]:
        data_dicts = standardize_data_dicts(data_dicts=data_dicts, cfg=cfg)
    else:
        logger.info("No standardization applied")

    return data_dicts

preprocess_data ¶

preprocess_PLR_data ¶

preprocess_PLR_data(
    X: ndarray,
    preprocess_cfg: Union[Dict[str, Any], DictConfig],
    preprocess_dict: Optional[Dict[str, Any]] = None,
    data_filtering: str = "gt",
    split: str = "train",
) -> Tuple[ndarray, Dict[str, Any]]

Preprocess PLR data by applying standardization if configured.

PARAMETER	DESCRIPTION
`X`	Input PLR data array to preprocess. TYPE: `ndarray`
`preprocess_cfg`	Configuration dictionary containing preprocessing settings, including 'standardize' and 'use_gt_stats_for_raw' flags. TYPE: `dict`
`preprocess_dict`	Dictionary to store/retrieve precomputed statistics. Default is None. TYPE: `dict` DEFAULT: `None`
`data_filtering`	Type of data filtering applied ('gt' for ground truth, 'raw' for raw data). Default is 'gt'. TYPE: `str` DEFAULT: `'gt'`
`split`	Data split identifier ('train', 'val', or 'test'). Default is 'train'. TYPE: `str` DEFAULT: `'train'`

RETURNS	DESCRIPTION
`tuple`	A tuple containing: - X : np.ndarray The preprocessed data array. - preprocess_dict : dict Updated preprocessing dictionary with computed statistics.

Source code in src/preprocess/preprocess_data.py

def preprocess_PLR_data(
    X: np.ndarray,
    preprocess_cfg: Union[Dict[str, Any], DictConfig],
    preprocess_dict: Optional[Dict[str, Any]] = None,
    data_filtering: str = "gt",
    split: str = "train",
) -> Tuple[np.ndarray, Dict[str, Any]]:
    """Preprocess PLR data by applying standardization if configured.

    Parameters
    ----------
    X : np.ndarray
        Input PLR data array to preprocess.
    preprocess_cfg : dict
        Configuration dictionary containing preprocessing settings,
        including 'standardize' and 'use_gt_stats_for_raw' flags.
    preprocess_dict : dict, optional
        Dictionary to store/retrieve precomputed statistics. Default is None.
    data_filtering : str, optional
        Type of data filtering applied ('gt' for ground truth, 'raw' for raw data).
        Default is 'gt'.
    split : str, optional
        Data split identifier ('train', 'val', or 'test'). Default is 'train'.

    Returns
    -------
    tuple
        A tuple containing:
        - X : np.ndarray
            The preprocessed data array.
        - preprocess_dict : dict
            Updated preprocessing dictionary with computed statistics.
    """
    if preprocess_dict is None:
        preprocess_dict = {}

    if preprocess_cfg["standardize"]:
        use_precomputed, mean, std, filterkey = if_use_precomputed(
            preprocess_dict, preprocess_cfg, split, data_filtering
        )
        if use_precomputed:
            X = standardize_with_precomputed_stats(
                X, preprocess_dict, data_filtering, filterkey, split
            )
        else:
            preprocess_dict, X = compute_stats_and_standardize(
                preprocess_dict, X, data_filtering, split
            )

    logger.debug(
        'Number of NaNs in the "{}" data: {}'.format(data_filtering, np.isnan(X).sum())
    )

    return X, preprocess_dict

if_use_precomputed ¶

if_use_precomputed(
    preprocess_dict: Dict[str, Any],
    preprocess_cfg: Union[Dict[str, Any], DictConfig],
    split: str,
    data_filtering: str,
) -> Tuple[bool, Optional[float], Optional[float], str]

Determine whether to use precomputed standardization statistics.

PARAMETER	DESCRIPTION
`preprocess_dict`	Dictionary containing previously computed statistics. TYPE: `dict`
`preprocess_cfg`	Configuration dictionary with preprocessing settings. TYPE: `dict`
`split`	Data split identifier ('train', 'val', or 'test'). TYPE: `str`
`data_filtering`	Type of data filtering ('gt' or 'raw'). TYPE: `str`

RETURNS	DESCRIPTION
`tuple`	A tuple containing: - use_precomputed : bool Whether to use precomputed statistics. - mean : float or None Precomputed mean value if available. - std : float or None Precomputed standard deviation if available. - filterkey : str The key used to retrieve statistics from the dictionary.

Source code in src/preprocess/preprocess_data.py

def if_use_precomputed(
    preprocess_dict: Dict[str, Any],
    preprocess_cfg: Union[Dict[str, Any], DictConfig],
    split: str,
    data_filtering: str,
) -> Tuple[bool, Optional[float], Optional[float], str]:
    """Determine whether to use precomputed standardization statistics.

    Parameters
    ----------
    preprocess_dict : dict
        Dictionary containing previously computed statistics.
    preprocess_cfg : dict
        Configuration dictionary with preprocessing settings.
    split : str
        Data split identifier ('train', 'val', or 'test').
    data_filtering : str
        Type of data filtering ('gt' or 'raw').

    Returns
    -------
    tuple
        A tuple containing:
        - use_precomputed : bool
            Whether to use precomputed statistics.
        - mean : float or None
            Precomputed mean value if available.
        - std : float or None
            Precomputed standard deviation if available.
        - filterkey : str
            The key used to retrieve statistics from the dictionary.
    """
    mean, std, filterkey = None, None, "gt"
    if len(preprocess_dict) == 0:
        return False, mean, std, filterkey
    elif "standardize" in preprocess_dict:
        if preprocess_cfg["use_gt_stats_for_raw"]:
            logger.debug("Use mean&stdev from GT for raw data")
            if "gt" in preprocess_dict["standardize"]:
                mean = preprocess_dict["standardize"]["gt"]["mean"]
                std = preprocess_dict["standardize"]["gt"]["std"]
                log_stats_msg(mean, std, split, data_filtering, "precomputed")
                filterkey = "gt"
                return True, mean, std, filterkey
            else:
                return False, mean, std, filterkey
        else:
            raise NotImplementedError("Not implemented yet")

log_stats_msg ¶

log_stats_msg(
    mean: float,
    std: float,
    split: str,
    data_filtering: str,
    call_from: str = "precomputed",
) -> None

Log standardization statistics message for debugging.

PARAMETER	DESCRIPTION
`mean`	Mean value of the data. TYPE: `float`
`std`	Standard deviation of the data. TYPE: `float`
`split`	Data split identifier ('train', 'val', or 'test'). TYPE: `str`
`data_filtering`	Type of data filtering ('gt' or 'raw'). TYPE: `str`
`call_from`	Source of the call, either 'precomputed' or 'standardize'. Default is 'precomputed'. TYPE: `str` DEFAULT: `'precomputed'`

RAISES	DESCRIPTION
`NotImplementedError`	If call_from is neither 'precomputed' nor 'standardize'.

Source code in src/preprocess/preprocess_data.py

def log_stats_msg(
    mean: float,
    std: float,
    split: str,
    data_filtering: str,
    call_from: str = "precomputed",
) -> None:
    """Log standardization statistics message for debugging.

    Parameters
    ----------
    mean : float
        Mean value of the data.
    std : float
        Standard deviation of the data.
    split : str
        Data split identifier ('train', 'val', or 'test').
    data_filtering : str
        Type of data filtering ('gt' or 'raw').
    call_from : str, optional
        Source of the call, either 'precomputed' or 'standardize'.
        Default is 'precomputed'.

    Raises
    ------
    NotImplementedError
        If call_from is neither 'precomputed' nor 'standardize'.
    """
    if call_from == "precomputed":
        string = "Mean&Std already precomputed"
    elif call_from == "standardize":
        string = "STATS after standardization"
    else:
        raise NotImplementedError("Unknown call_from = {}".format(call_from))

    logger.debug(
        "{}: mean = {}, std = {}, split = {}, data_filtering = {}".format(
            string, mean, std, split, data_filtering
        )
    )

standardize_with_precomputed_stats ¶

standardize_with_precomputed_stats(
    X: ndarray,
    preprocess_dict: Dict[str, Any],
    data_filtering: str,
    filterkey: str,
    split: str,
) -> ndarray

Standardize data using precomputed mean and standard deviation.

PARAMETER	DESCRIPTION
`X`	Input data array to standardize. TYPE: `ndarray`
`preprocess_dict`	Dictionary containing precomputed standardization statistics. TYPE: `dict`
`data_filtering`	Type of data filtering ('gt' or 'raw'). TYPE: `str`
`filterkey`	Key to access the correct statistics in preprocess_dict. TYPE: `str`
`split`	Data split identifier ('train', 'val', or 'test'). TYPE: `str`

RETURNS	DESCRIPTION
`ndarray`	Standardized data array with zero mean and unit variance.

Source code in src/preprocess/preprocess_data.py

def standardize_with_precomputed_stats(
    X: np.ndarray,
    preprocess_dict: Dict[str, Any],
    data_filtering: str,
    filterkey: str,
    split: str,
) -> np.ndarray:
    """Standardize data using precomputed mean and standard deviation.

    Parameters
    ----------
    X : np.ndarray
        Input data array to standardize.
    preprocess_dict : dict
        Dictionary containing precomputed standardization statistics.
    data_filtering : str
        Type of data filtering ('gt' or 'raw').
    filterkey : str
        Key to access the correct statistics in preprocess_dict.
    split : str
        Data split identifier ('train', 'val', or 'test').

    Returns
    -------
    np.ndarray
        Standardized data array with zero mean and unit variance.
    """
    X = (X - preprocess_dict["standardize"][filterkey]["mean"]) / preprocess_dict[
        "standardize"
    ][filterkey]["std"]
    logger.debug(
        "Data has been standardized, mean = {}, std = {}".format(
            np.nanmean(X), np.nanstd(X)
        )
    )
    return X

compute_stats_and_standardize ¶

compute_stats_and_standardize(
    preprocess_dict: Dict[str, Any],
    X: ndarray,
    data_filtering: str,
    split: str,
) -> Tuple[Dict[str, Any], ndarray]

Compute standardization statistics and apply standardization to data.

Fits a StandardScaler to the data, transforms it, and stores the computed mean and standard deviation in the preprocess dictionary.

PARAMETER	DESCRIPTION
`preprocess_dict`	Dictionary to store the computed standardization statistics. TYPE: `dict`
`X`	Input data array to standardize. TYPE: `ndarray`
`data_filtering`	Type of data filtering ('gt' or 'raw'). TYPE: `str`
`split`	Data split identifier ('train', 'val', or 'test'). TYPE: `str`

RETURNS	DESCRIPTION
`tuple`	A tuple containing: - preprocess_dict : dict Updated dictionary with computed mean and std. - X : np.ndarray Standardized data array.

Source code in src/preprocess/preprocess_data.py

def compute_stats_and_standardize(
    preprocess_dict: Dict[str, Any],
    X: np.ndarray,
    data_filtering: str,
    split: str,
) -> Tuple[Dict[str, Any], np.ndarray]:
    """Compute standardization statistics and apply standardization to data.

    Fits a StandardScaler to the data, transforms it, and stores the
    computed mean and standard deviation in the preprocess dictionary.

    Parameters
    ----------
    preprocess_dict : dict
        Dictionary to store the computed standardization statistics.
    X : np.ndarray
        Input data array to standardize.
    data_filtering : str
        Type of data filtering ('gt' or 'raw').
    split : str
        Data split identifier ('train', 'val', or 'test').

    Returns
    -------
    tuple
        A tuple containing:
        - preprocess_dict : dict
            Updated dictionary with computed mean and std.
        - X : np.ndarray
            Standardized data array.
    """
    no_samples = X.shape[0] * X.shape[1]
    scaler = StandardScaler()
    scaler.fit(X.reshape(no_samples, -1))
    X = scaler.transform(X.reshape(no_samples, -1)).reshape(X.shape)
    preprocess_dict["standardize"] = {}
    print_stdz_stats(scaler, split, data_filtering)

    if "standardize" not in preprocess_dict:
        preprocess_dict["standardize"] = {}

    if data_filtering not in preprocess_dict["standardize"]:
        preprocess_dict["standardize"][data_filtering] = {}

    preprocess_dict["standardize"][data_filtering]["mean"] = float(scaler.mean_)
    preprocess_dict["standardize"][data_filtering]["std"] = float(scaler.scale_)

    log_stats_msg(np.nanmean(X), np.nanstd(X), split, data_filtering, "standardize")

    return preprocess_dict, X

print_stdz_stats ¶

print_stdz_stats(
    scaler: StandardScaler, split: str, data_filtering: str
) -> None

Print standardization statistics from a fitted scaler.

Logs the mean and scale values at INFO level for training ground truth, and at DEBUG level for other splits/filters to reduce log clutter.

PARAMETER	DESCRIPTION
`scaler`	Fitted StandardScaler object containing mean_ and scale_ attributes. TYPE: `StandardScaler`
`split`	Data split identifier ('train', 'val', or 'test'). TYPE: `str`
`data_filtering`	Type of data filtering ('gt' or 'raw'). TYPE: `str`

Source code in src/preprocess/preprocess_data.py

def print_stdz_stats(scaler: StandardScaler, split: str, data_filtering: str) -> None:
    """Print standardization statistics from a fitted scaler.

    Logs the mean and scale values at INFO level for training ground truth,
    and at DEBUG level for other splits/filters to reduce log clutter.

    Parameters
    ----------
    scaler : sklearn.preprocessing.StandardScaler
        Fitted StandardScaler object containing mean_ and scale_ attributes.
    split : str
        Data split identifier ('train', 'val', or 'test').
    data_filtering : str
        Type of data filtering ('gt' or 'raw').
    """
    if split == "train" and data_filtering == "gt":
        # Print only once the standardized stats to reduce clutter
        logger.info(
            "Standardized (split = {}, split_key = {}), mean = {}, std = {}".format(
                split, data_filtering, scaler.mean_, scaler.scale_
            )
        )
    else:
        logger.debug(
            "Standardized, mean = {}, std = {}".format(scaler.mean_, scaler.scale_)
        )

debug_triplet_stats ¶

debug_triplet_stats(
    X_gt: ndarray,
    X_gt_missing: ndarray,
    X_raw: ndarray,
    split: str,
) -> None

Log debug statistics for the data filtering triplet.

Computes and logs mean, standard deviation, and NaN count for ground truth, ground truth with missing values, and raw data.

PARAMETER	DESCRIPTION
`X_gt`	Ground truth data array. TYPE: `ndarray`
`X_gt_missing`	Ground truth data with missing values (NaNs). TYPE: `ndarray`
`X_raw`	Raw unprocessed data array. TYPE: `ndarray`
`split`	Data split identifier ('train', 'val', or 'test'). TYPE: `str`

RETURNS	DESCRIPTION
`None`

Source code in src/preprocess/preprocess_data.py

def debug_triplet_stats(
    X_gt: np.ndarray,
    X_gt_missing: np.ndarray,
    X_raw: np.ndarray,
    split: str,
) -> None:
    """Log debug statistics for the data filtering triplet.

    Computes and logs mean, standard deviation, and NaN count for
    ground truth, ground truth with missing values, and raw data.

    Parameters
    ----------
    X_gt : np.ndarray
        Ground truth data array.
    X_gt_missing : np.ndarray
        Ground truth data with missing values (NaNs).
    X_raw : np.ndarray
        Raw unprocessed data array.
    split : str
        Data split identifier ('train', 'val', or 'test').

    Returns
    -------
    None
    """

    def stats_per_split(X: np.ndarray, split: str) -> Dict[str, float]:
        logger.debug(
            "{}: mean = {}, std = {}, no_NaN = {}".format(
                split, np.nanmean(X), np.nanstd(X), np.isnan(X).sum()
            )
        )
        return {"mean": np.nanmean(X), "std": np.nanstd(X), "no_NaN": np.isnan(X).sum()}

    logger.debug("DEBUG FOR THE 'FILTERING TRIPLET', split = {}:".format(split))
    stats_per_split(X_gt, "GT")
    stats_per_split(X_gt_missing, "GT_MISSING")
    stats_per_split(X_raw, "RAW")

    return None

destandardize_for_imputation_metric ¶

destandardize_for_imputation_metric(
    targets: ndarray,
    predictions: ndarray,
    stdz_dict: Dict[str, Any],
) -> Tuple[ndarray, ndarray]

Destandardize targets and predictions for computing imputation metrics.

Reverses the standardization transformation to compute metrics in the original data scale.

PARAMETER	DESCRIPTION
`targets`	Ground truth target values (potentially standardized). TYPE: `ndarray`
`predictions`	Model predictions (potentially standardized). TYPE: `ndarray`
`stdz_dict`	Standardization dictionary containing 'standardized' boolean, 'mean', and 'stdev' values. TYPE: `dict`

RETURNS	DESCRIPTION
`tuple`	A tuple containing: - targets : np.ndarray Destandardized target values. - predictions : np.ndarray Destandardized prediction values.

Source code in src/preprocess/preprocess_data.py

def destandardize_for_imputation_metric(
    targets: np.ndarray,
    predictions: np.ndarray,
    stdz_dict: Dict[str, Any],
) -> Tuple[np.ndarray, np.ndarray]:
    """Destandardize targets and predictions for computing imputation metrics.

    Reverses the standardization transformation to compute metrics in
    the original data scale.

    Parameters
    ----------
    targets : np.ndarray
        Ground truth target values (potentially standardized).
    predictions : np.ndarray
        Model predictions (potentially standardized).
    stdz_dict : dict
        Standardization dictionary containing 'standardized' boolean,
        'mean', and 'stdev' values.

    Returns
    -------
    tuple
        A tuple containing:
        - targets : np.ndarray
            Destandardized target values.
        - predictions : np.ndarray
            Destandardized prediction values.
    """
    if stdz_dict["standardized"]:
        targets = destandardize_numpy(targets, stdz_dict["mean"], stdz_dict["stdev"])
        predictions = destandardize_numpy(
            predictions, stdz_dict["mean"], stdz_dict["stdev"]
        )

    return targets, predictions

destandardize_dict ¶

destandardize_dict(
    imputation_dict: Dict[str, Any], mean: float, std: float
) -> Dict[str, Any]

Destandardize the mean values in an imputation results dictionary.

PARAMETER	DESCRIPTION
`imputation_dict`	Dictionary containing imputation results with a 'mean' key. TYPE: `dict`
`mean`	Mean value used for original standardization. TYPE: `float`
`std`	Standard deviation used for original standardization. TYPE: `float`

RETURNS	DESCRIPTION
`dict`	Updated imputation dictionary with destandardized mean values.

Notes

TODO: Confidence intervals (CI) are not yet destandardized.

Source code in src/preprocess/preprocess_data.py

def destandardize_dict(
    imputation_dict: Dict[str, Any],
    mean: float,
    std: float,
) -> Dict[str, Any]:
    """Destandardize the mean values in an imputation results dictionary.

    Parameters
    ----------
    imputation_dict : dict
        Dictionary containing imputation results with a 'mean' key.
    mean : float
        Mean value used for original standardization.
    std : float
        Standard deviation used for original standardization.

    Returns
    -------
    dict
        Updated imputation dictionary with destandardized mean values.

    Notes
    -----
    TODO: Confidence intervals (CI) are not yet destandardized.
    """
    logger.debug(
        "De-standardizing the imputation results with mean = {} and std = {}".format(
            mean, std
        )
    )
    imputation_dict["mean"] = imputation_dict["mean"] * std + mean
    # TODO! Also for the confidence intervals (CI)
    return imputation_dict

destandardize_numpy ¶

destandardize_numpy(
    X: ndarray, mean: float, std: float
) -> ndarray

Reverse standardization on a numpy array.

Applies the inverse transformation: X_original = X_standardized * std + mean.

PARAMETER	DESCRIPTION
`X`	Standardized data array. TYPE: `ndarray`
`mean`	Mean value used for original standardization. TYPE: `float`
`std`	Standard deviation used for original standardization. TYPE: `float`

RETURNS	DESCRIPTION
`ndarray`	Destandardized data array in original scale.

Source code in src/preprocess/preprocess_data.py

def destandardize_numpy(X: np.ndarray, mean: float, std: float) -> np.ndarray:
    """Reverse standardization on a numpy array.

    Applies the inverse transformation: X_original = X_standardized * std + mean.

    Parameters
    ----------
    X : np.ndarray
        Standardized data array.
    mean : float
        Mean value used for original standardization.
    std : float
        Standard deviation used for original standardization.

    Returns
    -------
    np.ndarray
        Destandardized data array in original scale.
    """
    logger.debug(
        "De-standardizing the imputation results with mean = {} and std = {}".format(
            mean, std
        )
    )
    return X * std + mean

destandardize_for_imputation_metrics ¶

destandardize_for_imputation_metrics(
    targets: ndarray,
    predictions: ndarray,
    preprocess_dict: Dict[str, Any],
) -> Tuple[ndarray, ndarray]

Destandardize targets and predictions with automatic scale detection.

Detects if predictions and targets are on different scales (one destandardized, one not) and corrects accordingly before returning both in the original scale.

PARAMETER	DESCRIPTION
`targets`	Ground truth target values. TYPE: `ndarray`
`predictions`	Model predictions. TYPE: `ndarray`
`preprocess_dict`	Dictionary containing 'standardization' sub-dict with 'standardized', 'mean', and 'stdev' keys. TYPE: `dict`

RETURNS	DESCRIPTION
`tuple`	A tuple containing: - targets : np.ndarray Destandardized target values. - predictions : np.ndarray Destandardized prediction values.

Notes

If predictions are more than 100x larger than targets in absolute mean, assumes predictions were already destandardized and only destandardizes targets.

Source code in src/preprocess/preprocess_data.py

def destandardize_for_imputation_metrics(
    targets: np.ndarray,
    predictions: np.ndarray,
    preprocess_dict: Dict[str, Any],
) -> Tuple[np.ndarray, np.ndarray]:
    """Destandardize targets and predictions with automatic scale detection.

    Detects if predictions and targets are on different scales (one
    destandardized, one not) and corrects accordingly before returning
    both in the original scale.

    Parameters
    ----------
    targets : np.ndarray
        Ground truth target values.
    predictions : np.ndarray
        Model predictions.
    preprocess_dict : dict
        Dictionary containing 'standardization' sub-dict with 'standardized',
        'mean', and 'stdev' keys.

    Returns
    -------
    tuple
        A tuple containing:
        - targets : np.ndarray
            Destandardized target values.
        - predictions : np.ndarray
            Destandardized prediction values.

    Notes
    -----
    If predictions are more than 100x larger than targets in absolute mean,
    assumes predictions were already destandardized and only destandardizes
    targets.
    """
    predictions_mean = np.nanmean(predictions)
    targets_mean = np.nanmean(targets)
    predictions_larger_ratio = abs(predictions_mean) / abs(targets_mean)
    if predictions_larger_ratio > 100:
        logger.debug(
            "Predictions are larger than targets by a factor of {}".format(
                predictions_larger_ratio
            )
        )
        logger.debug(
            "It seems that your predictions are inverse transformed (destandardized) and targets are not"
        )
        logger.debug(
            "Check if you have destandardized the predictions and targets correctly"
        )
        logger.debug("Destandardizing now the targets as well for you")
        targets = destandardize_numpy(
            targets,
            preprocess_dict["standardization"]["mean"],
            preprocess_dict["standardization"]["stdev"],
        )
    else:
        if preprocess_dict["standardization"]["standardized"]:
            targets = destandardize_numpy(
                targets,
                preprocess_dict["standardization"]["mean"],
                preprocess_dict["standardization"]["stdev"],
            )
            predictions = destandardize_numpy(
                predictions,
                preprocess_dict["standardization"]["mean"],
                preprocess_dict["standardization"]["stdev"],
            )

    return targets, predictions

preprocess_utils ¶

compute_stats_per_split ¶

compute_stats_per_split(X, split_name)

Compute and log basic statistics for a data split.

Calculates mean and standard deviation using NaN-aware functions and logs the results for debugging purposes.

PARAMETER	DESCRIPTION
`X`	Data array for which to compute statistics. TYPE: `ndarray`
`split_name`	Name of the data split (e.g., 'train_gt', 'val_missing') used for logging context. TYPE: `str`

RETURNS	DESCRIPTION
`dict`	Dictionary containing 'mean' and 'std' statistics.

Notes

Train splits are expected to have near-perfect standardization (mean=0, std=1), while validation splits may deviate slightly. Missing data splits differ from ground truth due to masking applied after standardization.

Source code in src/preprocess/preprocess_utils.py

def compute_stats_per_split(X, split_name):
    """Compute and log basic statistics for a data split.

    Calculates mean and standard deviation using NaN-aware functions
    and logs the results for debugging purposes.

    Parameters
    ----------
    X : np.ndarray
        Data array for which to compute statistics.
    split_name : str
        Name of the data split (e.g., 'train_gt', 'val_missing')
        used for logging context.

    Returns
    -------
    dict
        Dictionary containing 'mean' and 'std' statistics.

    Notes
    -----
    Train splits are expected to have near-perfect standardization
    (mean=0, std=1), while validation splits may deviate slightly.
    Missing data splits differ from ground truth due to masking
    applied after standardization.
    """
    stats = {
        "mean": np.nanmean(X),
        "std": np.nanstd(X),
    }
    # You would expect the train split to have "perfect standardization" (mean=0, std=1), whereas the val split
    # is slightly off as the data is slightly different. Similarly the _missing is slightly different from the _gt
    # as the missingness masking is done after the standardization.
    logger.debug(
        "Data stats for the split_key {} | mean = {}, std = {}".format(
            split_name, stats["mean"], stats["std"]
        )
    )
    return stats