Skip to content

preprocess

Data preprocessing utilities.

Overview

Preprocessing functions for PLR data preparation.

preprocess_PLR

get_standardization_stats

get_standardization_stats(
    split: str, col_name: str, data_dicts_df: dict
)

Retrieve mean and standard deviation for standardization.

PARAMETER DESCRIPTION
split

Data split to compute statistics from (typically 'train').

TYPE: str

col_name

Column name in the data dictionary to standardize.

TYPE: str

data_dicts_df

Nested dictionary containing data arrays organized by split and column.

TYPE: dict

RETURNS DESCRIPTION
tuple

A tuple containing: - mean : float NaN-aware mean of the specified column. - std : float NaN-aware standard deviation of the specified column.

Source code in src/preprocess/preprocess_PLR.py
def get_standardization_stats(split: str, col_name: str, data_dicts_df: dict):
    """Retrieve mean and standard deviation for standardization.

    Parameters
    ----------
    split : str
        Data split to compute statistics from (typically 'train').
    col_name : str
        Column name in the data dictionary to standardize.
    data_dicts_df : dict
        Nested dictionary containing data arrays organized by split and column.

    Returns
    -------
    tuple
        A tuple containing:
        - mean : float
            NaN-aware mean of the specified column.
        - std : float
            NaN-aware standard deviation of the specified column.
    """
    logger.debug("Standardizing on split = {}, dict_key = {}".format(split, col_name))
    mean = np.nanmean(data_dicts_df[split]["data"][col_name])
    std = np.nanstd(data_dicts_df[split]["data"][col_name])
    return mean, std

standardize_the_data_dict

standardize_the_data_dict(mean, stdev, data_dicts_df, cfg)

Apply standardization to all columns across all splits.

Transforms each data column using z-score normalization: X_standardized = (X - mean) / stdev.

PARAMETER DESCRIPTION
mean

Mean value for standardization.

TYPE: float

stdev

Standard deviation for standardization.

TYPE: float

data_dicts_df

Nested dictionary with structure {split: {'data': {col_name: array}}}.

TYPE: dict

cfg

Configuration dictionary (currently unused but kept for API consistency).

TYPE: DictConfig

RETURNS DESCRIPTION
dict

Updated data dictionary with standardized values.

Source code in src/preprocess/preprocess_PLR.py
def standardize_the_data_dict(mean, stdev, data_dicts_df, cfg):
    """Apply standardization to all columns across all splits.

    Transforms each data column using z-score normalization:
    X_standardized = (X - mean) / stdev.

    Parameters
    ----------
    mean : float
        Mean value for standardization.
    stdev : float
        Standard deviation for standardization.
    data_dicts_df : dict
        Nested dictionary with structure {split: {'data': {col_name: array}}}.
    cfg : DictConfig
        Configuration dictionary (currently unused but kept for API consistency).

    Returns
    -------
    dict
        Updated data dictionary with standardized values.
    """
    for split in data_dicts_df.keys():
        for col_name in data_dicts_df[split]["data"].keys():
            logger.debug("Standardizing column: {}".format(col_name))
            array_tmp = data_dicts_df[split]["data"][col_name]
            array_tmp = (array_tmp - mean) / stdev
            data_dicts_df[split]["data"][col_name] = array_tmp

    return data_dicts_df

destandardize_the_data_dict_for_featurization

destandardize_the_data_dict_for_featurization(
    split, split_dict, preprocess_dict, cfg
)

Destandardize data before feature extraction.

Reverses standardization to restore original scale, which is required for computing physiologically meaningful handcrafted features.

PARAMETER DESCRIPTION
split

Data split identifier ('train', 'val', or 'test').

TYPE: str

split_dict

Dictionary containing data for a single split.

TYPE: dict

preprocess_dict

Dictionary containing 'standardization' sub-dict with 'standardized', 'mean', and 'stdev' keys.

TYPE: dict

cfg

Configuration dictionary.

TYPE: DictConfig

RETURNS DESCRIPTION
dict

Deep copy of split_dict with destandardized data values.

Source code in src/preprocess/preprocess_PLR.py
def destandardize_the_data_dict_for_featurization(
    split, split_dict, preprocess_dict, cfg
):
    """Destandardize data before feature extraction.

    Reverses standardization to restore original scale, which is required
    for computing physiologically meaningful handcrafted features.

    Parameters
    ----------
    split : str
        Data split identifier ('train', 'val', or 'test').
    split_dict : dict
        Dictionary containing data for a single split.
    preprocess_dict : dict
        Dictionary containing 'standardization' sub-dict with 'standardized',
        'mean', and 'stdev' keys.
    cfg : DictConfig
        Configuration dictionary.

    Returns
    -------
    dict
        Deep copy of split_dict with destandardized data values.
    """
    if preprocess_dict["standardization"]["standardized"]:
        logger.info(
            "Destandardizing the data for featurization, split = {}".format(split)
        )
        mean = preprocess_dict["standardization"]["mean"]
        stdev = preprocess_dict["standardization"]["stdev"]
        dicts_out = deepcopy(split_dict)
        dicts_out = destandardize_the_split_dict(dicts_out, split, stdev, mean, cfg)
    else:
        logger.info("No standardization applied, so no destandardization needed")
    return dicts_out

destandardize_the_split_dict

destandardize_the_split_dict(
    data_dicts_df, split, stdev, mean, cfg
)

Destandardize all non-mask columns in a split dictionary.

Applies inverse z-score transformation: X_original = X_standardized * stdev + mean.

PARAMETER DESCRIPTION
data_dicts_df

Dictionary containing 'data' sub-dict with column arrays.

TYPE: dict

split

Data split identifier (used for logging).

TYPE: str

stdev

Standard deviation used in original standardization.

TYPE: float

mean

Mean used in original standardization.

TYPE: float

cfg

Configuration dictionary (currently unused).

TYPE: DictConfig

RETURNS DESCRIPTION
dict

Updated dictionary with destandardized values.

Notes

The 'mask' column is skipped as it contains boolean/integer flags, not continuous values that were standardized.

Source code in src/preprocess/preprocess_PLR.py
def destandardize_the_split_dict(data_dicts_df, split, stdev, mean, cfg):
    """Destandardize all non-mask columns in a split dictionary.

    Applies inverse z-score transformation: X_original = X_standardized * stdev + mean.

    Parameters
    ----------
    data_dicts_df : dict
        Dictionary containing 'data' sub-dict with column arrays.
    split : str
        Data split identifier (used for logging).
    stdev : float
        Standard deviation used in original standardization.
    mean : float
        Mean used in original standardization.
    cfg : DictConfig
        Configuration dictionary (currently unused).

    Returns
    -------
    dict
        Updated dictionary with destandardized values.

    Notes
    -----
    The 'mask' column is skipped as it contains boolean/integer flags,
    not continuous values that were standardized.
    """
    for col_name in data_dicts_df["data"].keys():
        if col_name != "mask":
            # or inverse transform as you wish to call this
            logger.debug("DeStandardizing column: {}".format(col_name))
            array_tmp = data_dicts_df["data"][col_name]
            array_tmp = (array_tmp * stdev) + mean
            data_dicts_df["data"][col_name] = array_tmp
    return data_dicts_df

standardize_data_dicts

standardize_data_dicts(data_dicts: dict, cfg: DictConfig)

Standardize all data dictionaries using training set statistics.

Computes mean and standard deviation from the training split and applies standardization across all splits. Stores computed statistics in the preprocess sub-dictionary.

PARAMETER DESCRIPTION
data_dicts

Main data dictionary containing 'df' with nested split data.

TYPE: dict

cfg

Configuration with PREPROCESS.col_name specifying which column to use for computing statistics.

TYPE: DictConfig

RETURNS DESCRIPTION
dict

Updated data dictionary with standardized values and added 'preprocess.standardization' metadata.

Source code in src/preprocess/preprocess_PLR.py
def standardize_data_dicts(data_dicts: dict, cfg: DictConfig):
    """Standardize all data dictionaries using training set statistics.

    Computes mean and standard deviation from the training split and applies
    standardization across all splits. Stores computed statistics in the
    preprocess sub-dictionary.

    Parameters
    ----------
    data_dicts : dict
        Main data dictionary containing 'df' with nested split data.
    cfg : DictConfig
        Configuration with PREPROCESS.col_name specifying which column
        to use for computing statistics.

    Returns
    -------
    dict
        Updated data dictionary with standardized values and added
        'preprocess.standardization' metadata.
    """
    mean, stdev = get_standardization_stats(
        split="train",
        col_name=cfg["PREPROCESS"]["col_name"],
        data_dicts_df=data_dicts["df"],
    )

    logger.info("Standardizing, mean = {}, stdev = {}".format(mean, stdev))
    data_dicts["df"] = standardize_the_data_dict(
        mean=mean, stdev=stdev, data_dicts_df=data_dicts["df"], cfg=cfg
    )

    if "preprocess" not in data_dicts:
        data_dicts["preprocess"] = {}
        data_dicts["preprocess"]["standardization"] = {
            "standardized": True,
            "mean": mean,
            "stdev": stdev,
        }

    return data_dicts

standardize_recons_arrays

standardize_recons_arrays(array_in, stdz_dict: dict)

Standardize reconstruction arrays using stored statistics.

PARAMETER DESCRIPTION
array_in

Input array to standardize.

TYPE: ndarray

stdz_dict

Dictionary containing 'mean' and 'stdev' for standardization.

TYPE: dict

RETURNS DESCRIPTION
ndarray

Standardized array (deep copy of input).

Source code in src/preprocess/preprocess_PLR.py
def standardize_recons_arrays(array_in, stdz_dict: dict):
    """Standardize reconstruction arrays using stored statistics.

    Parameters
    ----------
    array_in : np.ndarray
        Input array to standardize.
    stdz_dict : dict
        Dictionary containing 'mean' and 'stdev' for standardization.

    Returns
    -------
    np.ndarray
        Standardized array (deep copy of input).
    """
    array_out = deepcopy(array_in)
    array_out = array_out - stdz_dict["mean"]
    array_out = array_out / stdz_dict["stdev"]
    return array_out

preprocess_data_dicts

preprocess_data_dicts(data_dicts: dict, cfg: DictConfig)

Main preprocessing entry point for data dictionaries.

Applies configured preprocessing steps (currently only standardization) to the data dictionaries.

PARAMETER DESCRIPTION
data_dicts

Main data dictionary containing 'df' with nested split data.

TYPE: dict

cfg

Configuration with PREPROCESS settings, including 'standardize' flag.

TYPE: DictConfig

RETURNS DESCRIPTION
dict

Preprocessed data dictionary.

Source code in src/preprocess/preprocess_PLR.py
def preprocess_data_dicts(data_dicts: dict, cfg: DictConfig):
    """Main preprocessing entry point for data dictionaries.

    Applies configured preprocessing steps (currently only standardization)
    to the data dictionaries.

    Parameters
    ----------
    data_dicts : dict
        Main data dictionary containing 'df' with nested split data.
    cfg : DictConfig
        Configuration with PREPROCESS settings, including 'standardize' flag.

    Returns
    -------
    dict
        Preprocessed data dictionary.
    """
    if cfg["PREPROCESS"]["standardize"]:
        data_dicts = standardize_data_dicts(data_dicts=data_dicts, cfg=cfg)
    else:
        logger.info("No standardization applied")

    return data_dicts

preprocess_data

preprocess_PLR_data

preprocess_PLR_data(
    X: ndarray,
    preprocess_cfg: Union[Dict[str, Any], DictConfig],
    preprocess_dict: Optional[Dict[str, Any]] = None,
    data_filtering: str = "gt",
    split: str = "train",
) -> Tuple[ndarray, Dict[str, Any]]

Preprocess PLR data by applying standardization if configured.

PARAMETER DESCRIPTION
X

Input PLR data array to preprocess.

TYPE: ndarray

preprocess_cfg

Configuration dictionary containing preprocessing settings, including 'standardize' and 'use_gt_stats_for_raw' flags.

TYPE: dict

preprocess_dict

Dictionary to store/retrieve precomputed statistics. Default is None.

TYPE: dict DEFAULT: None

data_filtering

Type of data filtering applied ('gt' for ground truth, 'raw' for raw data). Default is 'gt'.

TYPE: str DEFAULT: 'gt'

split

Data split identifier ('train', 'val', or 'test'). Default is 'train'.

TYPE: str DEFAULT: 'train'

RETURNS DESCRIPTION
tuple

A tuple containing: - X : np.ndarray The preprocessed data array. - preprocess_dict : dict Updated preprocessing dictionary with computed statistics.

Source code in src/preprocess/preprocess_data.py
def preprocess_PLR_data(
    X: np.ndarray,
    preprocess_cfg: Union[Dict[str, Any], DictConfig],
    preprocess_dict: Optional[Dict[str, Any]] = None,
    data_filtering: str = "gt",
    split: str = "train",
) -> Tuple[np.ndarray, Dict[str, Any]]:
    """Preprocess PLR data by applying standardization if configured.

    Parameters
    ----------
    X : np.ndarray
        Input PLR data array to preprocess.
    preprocess_cfg : dict
        Configuration dictionary containing preprocessing settings,
        including 'standardize' and 'use_gt_stats_for_raw' flags.
    preprocess_dict : dict, optional
        Dictionary to store/retrieve precomputed statistics. Default is None.
    data_filtering : str, optional
        Type of data filtering applied ('gt' for ground truth, 'raw' for raw data).
        Default is 'gt'.
    split : str, optional
        Data split identifier ('train', 'val', or 'test'). Default is 'train'.

    Returns
    -------
    tuple
        A tuple containing:
        - X : np.ndarray
            The preprocessed data array.
        - preprocess_dict : dict
            Updated preprocessing dictionary with computed statistics.
    """
    if preprocess_dict is None:
        preprocess_dict = {}

    if preprocess_cfg["standardize"]:
        use_precomputed, mean, std, filterkey = if_use_precomputed(
            preprocess_dict, preprocess_cfg, split, data_filtering
        )
        if use_precomputed:
            X = standardize_with_precomputed_stats(
                X, preprocess_dict, data_filtering, filterkey, split
            )
        else:
            preprocess_dict, X = compute_stats_and_standardize(
                preprocess_dict, X, data_filtering, split
            )

    logger.debug(
        'Number of NaNs in the "{}" data: {}'.format(data_filtering, np.isnan(X).sum())
    )

    return X, preprocess_dict

if_use_precomputed

if_use_precomputed(
    preprocess_dict: Dict[str, Any],
    preprocess_cfg: Union[Dict[str, Any], DictConfig],
    split: str,
    data_filtering: str,
) -> Tuple[bool, Optional[float], Optional[float], str]

Determine whether to use precomputed standardization statistics.

PARAMETER DESCRIPTION
preprocess_dict

Dictionary containing previously computed statistics.

TYPE: dict

preprocess_cfg

Configuration dictionary with preprocessing settings.

TYPE: dict

split

Data split identifier ('train', 'val', or 'test').

TYPE: str

data_filtering

Type of data filtering ('gt' or 'raw').

TYPE: str

RETURNS DESCRIPTION
tuple

A tuple containing: - use_precomputed : bool Whether to use precomputed statistics. - mean : float or None Precomputed mean value if available. - std : float or None Precomputed standard deviation if available. - filterkey : str The key used to retrieve statistics from the dictionary.

Source code in src/preprocess/preprocess_data.py
def if_use_precomputed(
    preprocess_dict: Dict[str, Any],
    preprocess_cfg: Union[Dict[str, Any], DictConfig],
    split: str,
    data_filtering: str,
) -> Tuple[bool, Optional[float], Optional[float], str]:
    """Determine whether to use precomputed standardization statistics.

    Parameters
    ----------
    preprocess_dict : dict
        Dictionary containing previously computed statistics.
    preprocess_cfg : dict
        Configuration dictionary with preprocessing settings.
    split : str
        Data split identifier ('train', 'val', or 'test').
    data_filtering : str
        Type of data filtering ('gt' or 'raw').

    Returns
    -------
    tuple
        A tuple containing:
        - use_precomputed : bool
            Whether to use precomputed statistics.
        - mean : float or None
            Precomputed mean value if available.
        - std : float or None
            Precomputed standard deviation if available.
        - filterkey : str
            The key used to retrieve statistics from the dictionary.
    """
    mean, std, filterkey = None, None, "gt"
    if len(preprocess_dict) == 0:
        return False, mean, std, filterkey
    elif "standardize" in preprocess_dict:
        if preprocess_cfg["use_gt_stats_for_raw"]:
            logger.debug("Use mean&stdev from GT for raw data")
            if "gt" in preprocess_dict["standardize"]:
                mean = preprocess_dict["standardize"]["gt"]["mean"]
                std = preprocess_dict["standardize"]["gt"]["std"]
                log_stats_msg(mean, std, split, data_filtering, "precomputed")
                filterkey = "gt"
                return True, mean, std, filterkey
            else:
                return False, mean, std, filterkey
        else:
            raise NotImplementedError("Not implemented yet")

log_stats_msg

log_stats_msg(
    mean: float,
    std: float,
    split: str,
    data_filtering: str,
    call_from: str = "precomputed",
) -> None

Log standardization statistics message for debugging.

PARAMETER DESCRIPTION
mean

Mean value of the data.

TYPE: float

std

Standard deviation of the data.

TYPE: float

split

Data split identifier ('train', 'val', or 'test').

TYPE: str

data_filtering

Type of data filtering ('gt' or 'raw').

TYPE: str

call_from

Source of the call, either 'precomputed' or 'standardize'. Default is 'precomputed'.

TYPE: str DEFAULT: 'precomputed'

RAISES DESCRIPTION
NotImplementedError

If call_from is neither 'precomputed' nor 'standardize'.

Source code in src/preprocess/preprocess_data.py
def log_stats_msg(
    mean: float,
    std: float,
    split: str,
    data_filtering: str,
    call_from: str = "precomputed",
) -> None:
    """Log standardization statistics message for debugging.

    Parameters
    ----------
    mean : float
        Mean value of the data.
    std : float
        Standard deviation of the data.
    split : str
        Data split identifier ('train', 'val', or 'test').
    data_filtering : str
        Type of data filtering ('gt' or 'raw').
    call_from : str, optional
        Source of the call, either 'precomputed' or 'standardize'.
        Default is 'precomputed'.

    Raises
    ------
    NotImplementedError
        If call_from is neither 'precomputed' nor 'standardize'.
    """
    if call_from == "precomputed":
        string = "Mean&Std already precomputed"
    elif call_from == "standardize":
        string = "STATS after standardization"
    else:
        raise NotImplementedError("Unknown call_from = {}".format(call_from))

    logger.debug(
        "{}: mean = {}, std = {}, split = {}, data_filtering = {}".format(
            string, mean, std, split, data_filtering
        )
    )

standardize_with_precomputed_stats

standardize_with_precomputed_stats(
    X: ndarray,
    preprocess_dict: Dict[str, Any],
    data_filtering: str,
    filterkey: str,
    split: str,
) -> ndarray

Standardize data using precomputed mean and standard deviation.

PARAMETER DESCRIPTION
X

Input data array to standardize.

TYPE: ndarray

preprocess_dict

Dictionary containing precomputed standardization statistics.

TYPE: dict

data_filtering

Type of data filtering ('gt' or 'raw').

TYPE: str

filterkey

Key to access the correct statistics in preprocess_dict.

TYPE: str

split

Data split identifier ('train', 'val', or 'test').

TYPE: str

RETURNS DESCRIPTION
ndarray

Standardized data array with zero mean and unit variance.

Source code in src/preprocess/preprocess_data.py
def standardize_with_precomputed_stats(
    X: np.ndarray,
    preprocess_dict: Dict[str, Any],
    data_filtering: str,
    filterkey: str,
    split: str,
) -> np.ndarray:
    """Standardize data using precomputed mean and standard deviation.

    Parameters
    ----------
    X : np.ndarray
        Input data array to standardize.
    preprocess_dict : dict
        Dictionary containing precomputed standardization statistics.
    data_filtering : str
        Type of data filtering ('gt' or 'raw').
    filterkey : str
        Key to access the correct statistics in preprocess_dict.
    split : str
        Data split identifier ('train', 'val', or 'test').

    Returns
    -------
    np.ndarray
        Standardized data array with zero mean and unit variance.
    """
    X = (X - preprocess_dict["standardize"][filterkey]["mean"]) / preprocess_dict[
        "standardize"
    ][filterkey]["std"]
    logger.debug(
        "Data has been standardized, mean = {}, std = {}".format(
            np.nanmean(X), np.nanstd(X)
        )
    )
    return X

compute_stats_and_standardize

compute_stats_and_standardize(
    preprocess_dict: Dict[str, Any],
    X: ndarray,
    data_filtering: str,
    split: str,
) -> Tuple[Dict[str, Any], ndarray]

Compute standardization statistics and apply standardization to data.

Fits a StandardScaler to the data, transforms it, and stores the computed mean and standard deviation in the preprocess dictionary.

PARAMETER DESCRIPTION
preprocess_dict

Dictionary to store the computed standardization statistics.

TYPE: dict

X

Input data array to standardize.

TYPE: ndarray

data_filtering

Type of data filtering ('gt' or 'raw').

TYPE: str

split

Data split identifier ('train', 'val', or 'test').

TYPE: str

RETURNS DESCRIPTION
tuple

A tuple containing: - preprocess_dict : dict Updated dictionary with computed mean and std. - X : np.ndarray Standardized data array.

Source code in src/preprocess/preprocess_data.py
def compute_stats_and_standardize(
    preprocess_dict: Dict[str, Any],
    X: np.ndarray,
    data_filtering: str,
    split: str,
) -> Tuple[Dict[str, Any], np.ndarray]:
    """Compute standardization statistics and apply standardization to data.

    Fits a StandardScaler to the data, transforms it, and stores the
    computed mean and standard deviation in the preprocess dictionary.

    Parameters
    ----------
    preprocess_dict : dict
        Dictionary to store the computed standardization statistics.
    X : np.ndarray
        Input data array to standardize.
    data_filtering : str
        Type of data filtering ('gt' or 'raw').
    split : str
        Data split identifier ('train', 'val', or 'test').

    Returns
    -------
    tuple
        A tuple containing:
        - preprocess_dict : dict
            Updated dictionary with computed mean and std.
        - X : np.ndarray
            Standardized data array.
    """
    no_samples = X.shape[0] * X.shape[1]
    scaler = StandardScaler()
    scaler.fit(X.reshape(no_samples, -1))
    X = scaler.transform(X.reshape(no_samples, -1)).reshape(X.shape)
    preprocess_dict["standardize"] = {}
    print_stdz_stats(scaler, split, data_filtering)

    if "standardize" not in preprocess_dict:
        preprocess_dict["standardize"] = {}

    if data_filtering not in preprocess_dict["standardize"]:
        preprocess_dict["standardize"][data_filtering] = {}

    preprocess_dict["standardize"][data_filtering]["mean"] = float(scaler.mean_)
    preprocess_dict["standardize"][data_filtering]["std"] = float(scaler.scale_)

    log_stats_msg(np.nanmean(X), np.nanstd(X), split, data_filtering, "standardize")

    return preprocess_dict, X

print_stdz_stats

print_stdz_stats(
    scaler: StandardScaler, split: str, data_filtering: str
) -> None

Print standardization statistics from a fitted scaler.

Logs the mean and scale values at INFO level for training ground truth, and at DEBUG level for other splits/filters to reduce log clutter.

PARAMETER DESCRIPTION
scaler

Fitted StandardScaler object containing mean_ and scale_ attributes.

TYPE: StandardScaler

split

Data split identifier ('train', 'val', or 'test').

TYPE: str

data_filtering

Type of data filtering ('gt' or 'raw').

TYPE: str

Source code in src/preprocess/preprocess_data.py
def print_stdz_stats(scaler: StandardScaler, split: str, data_filtering: str) -> None:
    """Print standardization statistics from a fitted scaler.

    Logs the mean and scale values at INFO level for training ground truth,
    and at DEBUG level for other splits/filters to reduce log clutter.

    Parameters
    ----------
    scaler : sklearn.preprocessing.StandardScaler
        Fitted StandardScaler object containing mean_ and scale_ attributes.
    split : str
        Data split identifier ('train', 'val', or 'test').
    data_filtering : str
        Type of data filtering ('gt' or 'raw').
    """
    if split == "train" and data_filtering == "gt":
        # Print only once the standardized stats to reduce clutter
        logger.info(
            "Standardized (split = {}, split_key = {}), mean = {}, std = {}".format(
                split, data_filtering, scaler.mean_, scaler.scale_
            )
        )
    else:
        logger.debug(
            "Standardized, mean = {}, std = {}".format(scaler.mean_, scaler.scale_)
        )

debug_triplet_stats

debug_triplet_stats(
    X_gt: ndarray,
    X_gt_missing: ndarray,
    X_raw: ndarray,
    split: str,
) -> None

Log debug statistics for the data filtering triplet.

Computes and logs mean, standard deviation, and NaN count for ground truth, ground truth with missing values, and raw data.

PARAMETER DESCRIPTION
X_gt

Ground truth data array.

TYPE: ndarray

X_gt_missing

Ground truth data with missing values (NaNs).

TYPE: ndarray

X_raw

Raw unprocessed data array.

TYPE: ndarray

split

Data split identifier ('train', 'val', or 'test').

TYPE: str

RETURNS DESCRIPTION
None
Source code in src/preprocess/preprocess_data.py
def debug_triplet_stats(
    X_gt: np.ndarray,
    X_gt_missing: np.ndarray,
    X_raw: np.ndarray,
    split: str,
) -> None:
    """Log debug statistics for the data filtering triplet.

    Computes and logs mean, standard deviation, and NaN count for
    ground truth, ground truth with missing values, and raw data.

    Parameters
    ----------
    X_gt : np.ndarray
        Ground truth data array.
    X_gt_missing : np.ndarray
        Ground truth data with missing values (NaNs).
    X_raw : np.ndarray
        Raw unprocessed data array.
    split : str
        Data split identifier ('train', 'val', or 'test').

    Returns
    -------
    None
    """

    def stats_per_split(X: np.ndarray, split: str) -> Dict[str, float]:
        logger.debug(
            "{}: mean = {}, std = {}, no_NaN = {}".format(
                split, np.nanmean(X), np.nanstd(X), np.isnan(X).sum()
            )
        )
        return {"mean": np.nanmean(X), "std": np.nanstd(X), "no_NaN": np.isnan(X).sum()}

    logger.debug("DEBUG FOR THE 'FILTERING TRIPLET', split = {}:".format(split))
    stats_per_split(X_gt, "GT")
    stats_per_split(X_gt_missing, "GT_MISSING")
    stats_per_split(X_raw, "RAW")

    return None

destandardize_for_imputation_metric

destandardize_for_imputation_metric(
    targets: ndarray,
    predictions: ndarray,
    stdz_dict: Dict[str, Any],
) -> Tuple[ndarray, ndarray]

Destandardize targets and predictions for computing imputation metrics.

Reverses the standardization transformation to compute metrics in the original data scale.

PARAMETER DESCRIPTION
targets

Ground truth target values (potentially standardized).

TYPE: ndarray

predictions

Model predictions (potentially standardized).

TYPE: ndarray

stdz_dict

Standardization dictionary containing 'standardized' boolean, 'mean', and 'stdev' values.

TYPE: dict

RETURNS DESCRIPTION
tuple

A tuple containing: - targets : np.ndarray Destandardized target values. - predictions : np.ndarray Destandardized prediction values.

Source code in src/preprocess/preprocess_data.py
def destandardize_for_imputation_metric(
    targets: np.ndarray,
    predictions: np.ndarray,
    stdz_dict: Dict[str, Any],
) -> Tuple[np.ndarray, np.ndarray]:
    """Destandardize targets and predictions for computing imputation metrics.

    Reverses the standardization transformation to compute metrics in
    the original data scale.

    Parameters
    ----------
    targets : np.ndarray
        Ground truth target values (potentially standardized).
    predictions : np.ndarray
        Model predictions (potentially standardized).
    stdz_dict : dict
        Standardization dictionary containing 'standardized' boolean,
        'mean', and 'stdev' values.

    Returns
    -------
    tuple
        A tuple containing:
        - targets : np.ndarray
            Destandardized target values.
        - predictions : np.ndarray
            Destandardized prediction values.
    """
    if stdz_dict["standardized"]:
        targets = destandardize_numpy(targets, stdz_dict["mean"], stdz_dict["stdev"])
        predictions = destandardize_numpy(
            predictions, stdz_dict["mean"], stdz_dict["stdev"]
        )

    return targets, predictions

destandardize_dict

destandardize_dict(
    imputation_dict: Dict[str, Any], mean: float, std: float
) -> Dict[str, Any]

Destandardize the mean values in an imputation results dictionary.

PARAMETER DESCRIPTION
imputation_dict

Dictionary containing imputation results with a 'mean' key.

TYPE: dict

mean

Mean value used for original standardization.

TYPE: float

std

Standard deviation used for original standardization.

TYPE: float

RETURNS DESCRIPTION
dict

Updated imputation dictionary with destandardized mean values.

Notes

TODO: Confidence intervals (CI) are not yet destandardized.

Source code in src/preprocess/preprocess_data.py
def destandardize_dict(
    imputation_dict: Dict[str, Any],
    mean: float,
    std: float,
) -> Dict[str, Any]:
    """Destandardize the mean values in an imputation results dictionary.

    Parameters
    ----------
    imputation_dict : dict
        Dictionary containing imputation results with a 'mean' key.
    mean : float
        Mean value used for original standardization.
    std : float
        Standard deviation used for original standardization.

    Returns
    -------
    dict
        Updated imputation dictionary with destandardized mean values.

    Notes
    -----
    TODO: Confidence intervals (CI) are not yet destandardized.
    """
    logger.debug(
        "De-standardizing the imputation results with mean = {} and std = {}".format(
            mean, std
        )
    )
    imputation_dict["mean"] = imputation_dict["mean"] * std + mean
    # TODO! Also for the confidence intervals (CI)
    return imputation_dict

destandardize_numpy

destandardize_numpy(
    X: ndarray, mean: float, std: float
) -> ndarray

Reverse standardization on a numpy array.

Applies the inverse transformation: X_original = X_standardized * std + mean.

PARAMETER DESCRIPTION
X

Standardized data array.

TYPE: ndarray

mean

Mean value used for original standardization.

TYPE: float

std

Standard deviation used for original standardization.

TYPE: float

RETURNS DESCRIPTION
ndarray

Destandardized data array in original scale.

Source code in src/preprocess/preprocess_data.py
def destandardize_numpy(X: np.ndarray, mean: float, std: float) -> np.ndarray:
    """Reverse standardization on a numpy array.

    Applies the inverse transformation: X_original = X_standardized * std + mean.

    Parameters
    ----------
    X : np.ndarray
        Standardized data array.
    mean : float
        Mean value used for original standardization.
    std : float
        Standard deviation used for original standardization.

    Returns
    -------
    np.ndarray
        Destandardized data array in original scale.
    """
    logger.debug(
        "De-standardizing the imputation results with mean = {} and std = {}".format(
            mean, std
        )
    )
    return X * std + mean

destandardize_for_imputation_metrics

destandardize_for_imputation_metrics(
    targets: ndarray,
    predictions: ndarray,
    preprocess_dict: Dict[str, Any],
) -> Tuple[ndarray, ndarray]

Destandardize targets and predictions with automatic scale detection.

Detects if predictions and targets are on different scales (one destandardized, one not) and corrects accordingly before returning both in the original scale.

PARAMETER DESCRIPTION
targets

Ground truth target values.

TYPE: ndarray

predictions

Model predictions.

TYPE: ndarray

preprocess_dict

Dictionary containing 'standardization' sub-dict with 'standardized', 'mean', and 'stdev' keys.

TYPE: dict

RETURNS DESCRIPTION
tuple

A tuple containing: - targets : np.ndarray Destandardized target values. - predictions : np.ndarray Destandardized prediction values.

Notes

If predictions are more than 100x larger than targets in absolute mean, assumes predictions were already destandardized and only destandardizes targets.

Source code in src/preprocess/preprocess_data.py
def destandardize_for_imputation_metrics(
    targets: np.ndarray,
    predictions: np.ndarray,
    preprocess_dict: Dict[str, Any],
) -> Tuple[np.ndarray, np.ndarray]:
    """Destandardize targets and predictions with automatic scale detection.

    Detects if predictions and targets are on different scales (one
    destandardized, one not) and corrects accordingly before returning
    both in the original scale.

    Parameters
    ----------
    targets : np.ndarray
        Ground truth target values.
    predictions : np.ndarray
        Model predictions.
    preprocess_dict : dict
        Dictionary containing 'standardization' sub-dict with 'standardized',
        'mean', and 'stdev' keys.

    Returns
    -------
    tuple
        A tuple containing:
        - targets : np.ndarray
            Destandardized target values.
        - predictions : np.ndarray
            Destandardized prediction values.

    Notes
    -----
    If predictions are more than 100x larger than targets in absolute mean,
    assumes predictions were already destandardized and only destandardizes
    targets.
    """
    predictions_mean = np.nanmean(predictions)
    targets_mean = np.nanmean(targets)
    predictions_larger_ratio = abs(predictions_mean) / abs(targets_mean)
    if predictions_larger_ratio > 100:
        logger.debug(
            "Predictions are larger than targets by a factor of {}".format(
                predictions_larger_ratio
            )
        )
        logger.debug(
            "It seems that your predictions are inverse transformed (destandardized) and targets are not"
        )
        logger.debug(
            "Check if you have destandardized the predictions and targets correctly"
        )
        logger.debug("Destandardizing now the targets as well for you")
        targets = destandardize_numpy(
            targets,
            preprocess_dict["standardization"]["mean"],
            preprocess_dict["standardization"]["stdev"],
        )
    else:
        if preprocess_dict["standardization"]["standardized"]:
            targets = destandardize_numpy(
                targets,
                preprocess_dict["standardization"]["mean"],
                preprocess_dict["standardization"]["stdev"],
            )
            predictions = destandardize_numpy(
                predictions,
                preprocess_dict["standardization"]["mean"],
                preprocess_dict["standardization"]["stdev"],
            )

    return targets, predictions

preprocess_utils

compute_stats_per_split

compute_stats_per_split(X, split_name)

Compute and log basic statistics for a data split.

Calculates mean and standard deviation using NaN-aware functions and logs the results for debugging purposes.

PARAMETER DESCRIPTION
X

Data array for which to compute statistics.

TYPE: ndarray

split_name

Name of the data split (e.g., 'train_gt', 'val_missing') used for logging context.

TYPE: str

RETURNS DESCRIPTION
dict

Dictionary containing 'mean' and 'std' statistics.

Notes

Train splits are expected to have near-perfect standardization (mean=0, std=1), while validation splits may deviate slightly. Missing data splits differ from ground truth due to masking applied after standardization.

Source code in src/preprocess/preprocess_utils.py
def compute_stats_per_split(X, split_name):
    """Compute and log basic statistics for a data split.

    Calculates mean and standard deviation using NaN-aware functions
    and logs the results for debugging purposes.

    Parameters
    ----------
    X : np.ndarray
        Data array for which to compute statistics.
    split_name : str
        Name of the data split (e.g., 'train_gt', 'val_missing')
        used for logging context.

    Returns
    -------
    dict
        Dictionary containing 'mean' and 'std' statistics.

    Notes
    -----
    Train splits are expected to have near-perfect standardization
    (mean=0, std=1), while validation splits may deviate slightly.
    Missing data splits differ from ground truth due to masking
    applied after standardization.
    """
    stats = {
        "mean": np.nanmean(X),
        "std": np.nanstd(X),
    }
    # You would expect the train split to have "perfect standardization" (mean=0, std=1), whereas the val split
    # is slightly off as the data is slightly different. Similarly the _missing is slightly different from the _gt
    # as the missingness masking is done after the standardization.
    logger.debug(
        "Data stats for the split_key {} | mean = {}, std = {}".format(
            split_name, stats["mean"], stats["std"]
        )
    )
    return stats