imputation¶
Signal reconstruction and imputation methods.
Overview¶
This module provides 7 imputation methods:
- Ground Truth: Human-corrected signals
- Deep Learning: SAITS, CSDI, TimesNet
- Foundation Models: MOMENT
- Traditional: Linear interpolation
Main Entry Point¶
flow_imputation
¶
flow_imputation
¶
Execute the PLR imputation flow with hyperparameter sweep.
Orchestrates the imputation pipeline by iterating over all combinations of data sources (from outlier detection) and hyperparameter configurations. Also handles ensembling of trained imputation models.
| PARAMETER | DESCRIPTION |
|---|---|
cfg
|
Full Hydra configuration containing PREFECT flow names, MLFLOW settings, and hyperparameter configurations for imputation models.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Results from the imputation pipeline (implicitly via MLflow logging). |
Notes
The flow performs the following steps: 1. Define hyperparameter groups from configuration 2. Download outlier detection outputs from MLflow 3. Define data sources (outlier detection + ground truth) 4. Run imputation for each source x hyperparameter combination 5. Recompute metrics for all submodels 6. Create ensemble models from submodels
Source code in src/imputation/flow_imputation.py
Imputation Core¶
imputation_main
¶
setup_PLR_worklow
¶
setup_PLR_worklow(
cfg: DictConfig, run_name: str
) -> Tuple[DictConfig, str, str, bool, Optional[Any], str]
Set up the PLR imputation workflow.
Configures the imputation pipeline: 1. Extract model name from config 2. Apply debug settings if enabled 3. Check for existing model runs
| PARAMETER | DESCRIPTION |
|---|---|
cfg
|
Full Hydra configuration.
TYPE:
|
run_name
|
MLflow run name.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple
|
(cfg, model_name, updated_name, train_ON, best_run, artifacts_dir) - cfg: Updated configuration - model_name: Imputation model name - updated_name: Updated run name - train_ON: Whether to retrain - best_run: Best existing run (if any) - artifacts_dir: Directory for artifacts |
Source code in src/imputation/imputation_main.py
mlflow_log_of_source_for_imputation
¶
Log source data information to MLflow for imputation tracking.
Records the outlier detection run information that was used as input for the imputation step. This enables traceability of the preprocessing pipeline through MLflow.
| PARAMETER | DESCRIPTION |
|---|---|
source_data
|
Source data dictionary containing 'mlflow' key with run metadata and outlier detection results. If 'mlflow' is None, logs ground truth parameters instead.
TYPE:
|
cfg
|
Full Hydra configuration containing OUTLIER_DETECTION settings.
TYPE:
|
Notes
Logs either the outlier detection run ID and best metric, or None/0 for ground truth data where no upstream outlier detection was used.
Source code in src/imputation/imputation_main.py
imputation_model_selector
¶
imputation_model_selector(
source_data: Dict[str, Any],
cfg: DictConfig,
model_name: str,
run_name: str,
artifacts_dir: str,
experiment_name: str,
) -> Tuple[Optional[Any], Optional[Dict[str, Any]]]
Select and execute imputation method.
Dispatches to the appropriate imputation implementation. Supports: - Deep learning: SAITS, CSDI, TimesNet (via PyPOTS) - Foundation models: MOMENT - Traditional: MissForest
| PARAMETER | DESCRIPTION |
|---|---|
source_data
|
Data from outlier detection stage with signals and masks.
TYPE:
|
cfg
|
Full Hydra configuration.
TYPE:
|
model_name
|
Imputation method name. One of: 'SAITS', 'CSDI', 'TimesNet', 'MISSFOREST', 'MOMENT'.
TYPE:
|
run_name
|
MLflow run name.
TYPE:
|
artifacts_dir
|
Directory for saving artifacts.
TYPE:
|
experiment_name
|
MLflow experiment name.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple
|
(model, imputation_artifacts) where: - model: Trained imputation model - imputation_artifacts: dict with imputed data and metrics |
| RAISES | DESCRIPTION |
|---|---|
NotImplementedError
|
If model_name is not supported. |
Source code in src/imputation/imputation_main.py
126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 | |
imputation_PLR_workflow
¶
imputation_PLR_workflow(
cfg: DictConfig,
source_name: str,
source_data: Dict[str, Any],
run_name: str,
experiment_name: str,
_visualize: bool = False,
) -> Optional[Dict[str, Any]]
Execute the PLR imputation workflow.
Main entry point for training or loading imputation models on PLR data. Handles workflow setup, model training/loading, and artifact management.
| PARAMETER | DESCRIPTION |
|---|---|
cfg
|
Full Hydra configuration including MODELS, EXPERIMENT, and MLFLOW settings.
TYPE:
|
source_name
|
Name identifier for the data source (e.g., outlier detection method name).
TYPE:
|
source_data
|
Data dictionary from outlier detection containing signals, masks, and metadata.
TYPE:
|
run_name
|
MLflow run name for this imputation experiment.
TYPE:
|
experiment_name
|
MLflow experiment name to log results under.
TYPE:
|
_visualize
|
Whether to generate visualizations (currently unused). Default is False.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Imputation artifacts containing imputed signals, metrics, and MLflow info. Returns None if training is skipped and no pre-computed results are loaded. |
Notes
The workflow checks for existing trained models and can skip retraining
if train_ON is False based on configuration and existing MLflow runs.
Source code in src/imputation/imputation_main.py
imputation_utils
¶
create_imputation_df
¶
Create a Polars DataFrame from imputation artifacts for visualization.
Combines baseline PLR data with imputed data and exports to DuckDB for downstream analysis and visualization.
| PARAMETER | DESCRIPTION |
|---|---|
imputer_artifacts
|
Dictionary containing imputation results per model, with 'mlflow' metadata for each model.
TYPE:
|
data_df
|
Original PLR data DataFrame with subject codes and time series.
TYPE:
|
cfg
|
Full Hydra configuration including DATA and MLFLOW settings.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Combined DataFrame with imputed values, model identifiers, and original data columns. |
Source code in src/imputation/imputation_utils.py
create_imputation_plot_df
¶
Create DataFrame containing imputation results per model, split, and subject.
Iterates through all combinations of models, splits, and split keys to construct a unified DataFrame suitable for plotting and analysis.
| PARAMETER | DESCRIPTION |
|---|---|
data_for_features
|
Nested dictionary with structure {model: {split: {split_key: data}}}, containing imputed values and metadata.
TYPE:
|
data_df
|
Original PLR data DataFrame with subject information.
TYPE:
|
cfg
|
Configuration containing DATA.PLR_length for validation.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Combined DataFrame with all imputation results, reordered to match initial column order. |
| RAISES | DESCRIPTION |
|---|---|
AssertionError
|
If row count is not a multiple of PLR_length. |
Source code in src/imputation/imputation_utils.py
concatenate_imputation_dfs
¶
Concatenate imputation DataFrames with proper type casting.
Processes and combines DataFrames for imputation results, ensuring consistent column types and naming conventions.
| PARAMETER | DESCRIPTION |
|---|---|
df_list
|
List of two Polars DataFrames [existing_df, new_df] to concatenate. The first may be empty, the second is processed before concatenation.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Vertically concatenated DataFrame with consistent column types. |
| RAISES | DESCRIPTION |
|---|---|
Exception
|
If concatenation fails due to schema mismatches. |
Source code in src/imputation/imputation_utils.py
create_subjects_df
¶
Create DataFrame for all subjects from imputation subplot data.
Combines time series data with subject metadata for all subjects in the subplot dictionary.
| PARAMETER | DESCRIPTION |
|---|---|
subplot_dict
|
Dictionary containing 'data' with imputation arrays (mean, CI, etc.) and 'metadata' with subject codes.
TYPE:
|
data_df
|
Original PLR data DataFrame for looking up subject information.
TYPE:
|
cfg
|
Configuration (unused but kept for interface consistency).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple
|
(df_out, size_debug) where df_out is the combined DataFrame and size_debug is a dict with 'no_subjects' and 'no_timepoints'. |
| RAISES | DESCRIPTION |
|---|---|
AssertionError
|
If timepoint counts don't match expected values. |
Source code in src/imputation/imputation_utils.py
get_subject_datadf
¶
Extract time series data for a specific subject from the DataFrame.
| PARAMETER | DESCRIPTION |
|---|---|
data_df
|
Full PLR data DataFrame containing all subjects.
TYPE:
|
subject_code
|
Unique identifier for the subject to extract.
TYPE:
|
no_timepoints
|
Expected number of timepoints for validation.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame containing only the specified subject's time series. |
| RAISES | DESCRIPTION |
|---|---|
AssertionError
|
If the number of rows doesn't match expected timepoints. |
Source code in src/imputation/imputation_utils.py
add_ts_cols
¶
add_ts_cols(
subplot_dict: dict,
df_out: DataFrame,
idx: int,
no_timepoints: int,
add_as_list: str = True,
)
Add time series columns from subplot data to a DataFrame.
Extracts and adds imputation data (mean, CI bounds, etc.) for a specific subject index to the output DataFrame.
| PARAMETER | DESCRIPTION |
|---|---|
subplot_dict
|
Dictionary containing 'data' with arrays keyed by time series type.
TYPE:
|
df_out
|
Output DataFrame to add columns to.
TYPE:
|
idx
|
Subject index to extract data for.
TYPE:
|
no_timepoints
|
Expected number of timepoints for validation.
TYPE:
|
add_as_list
|
If True, convert arrays to lists before adding (helps with Polars compatibility issues). Default is True.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with added time series columns. |
| RAISES | DESCRIPTION |
|---|---|
AssertionError
|
If the number of timepoints doesn't match expected value. |
Source code in src/imputation/imputation_utils.py
add_loop_keys
¶
Add model, split, and split_key identifiers as columns to DataFrame.
| PARAMETER | DESCRIPTION |
|---|---|
model
|
Name of the imputation model.
TYPE:
|
split
|
Data split identifier (e.g., 'train', 'test').
TYPE:
|
split_key
|
Additional split key identifier.
TYPE:
|
df_tmp
|
DataFrame to add identifier columns to.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with added 'model', 'split', and 'split_key' columns. |
Source code in src/imputation/imputation_utils.py
rename_ci_cols
¶
Rename confidence interval columns to standard names.
Normalizes column names for imputation confidence intervals to consistent 'ci_pos' and 'ci_neg' names.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
DataFrame with potentially inconsistent CI column names.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with standardized CI column names. |
Notes
This is a temporary fix for column naming inconsistencies that should be harmonized upstream.
Source code in src/imputation/imputation_utils.py
get_mlflow_cfgs_from_imputation_artifacts
¶
Extract MLflow configurations from imputation artifacts.
| PARAMETER | DESCRIPTION |
|---|---|
imputer_artifacts
|
Dictionary of imputation results keyed by model name, each containing 'mlflow' metadata.
TYPE:
|
cfg
|
Configuration (unused but kept for interface consistency).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary mapping model names to their MLflow configurations. |
Source code in src/imputation/imputation_utils.py
export_imputation_df
¶
Export imputation DataFrame to DuckDB and log to MLflow.
Creates a DuckDB database for each model's imputation results and logs it as an artifact to the corresponding MLflow run.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Combined imputation DataFrame with model column for filtering.
TYPE:
|
mlflow_cfgs
|
Dictionary mapping model names to MLflow configurations.
TYPE:
|
cfg
|
Configuration for export settings.
TYPE:
|
Source code in src/imputation/imputation_utils.py
get_imputation_results_from_mlflow_for_features
¶
Retrieve imputation results from MLflow for feature computation.
Fetches the best hyperparameter configurations and their corresponding imputation results from MLflow for use in downstream featurization.
| PARAMETER | DESCRIPTION |
|---|---|
experiment_name
|
MLflow experiment name to search for imputation runs.
TYPE:
|
cfg
|
Configuration for MLflow and model settings.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary mapping model names to their imputation results from MLflow. |
Source code in src/imputation/imputation_utils.py
Model Training¶
impute_with_models
¶
evaluate_pypots_model
¶
Evaluate a PyPOTS model by imputing missing values in the dataset.
Runs the trained PyPOTS model on the provided dataset to impute missing values, handling both deterministic and probabilistic outputs.
| PARAMETER | DESCRIPTION |
|---|---|
model
|
Trained PyPOTS imputation model (e.g., SAITS, CSDI, TimesNet).
TYPE:
|
dataset_dict
|
Dataset dictionary containing 'X' array with NaN values to impute.
TYPE:
|
split
|
Data split name ('train' or 'test').
TYPE:
|
cfg
|
Configuration (unused but kept for interface consistency).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Imputation results containing: - 'imputation_dict': Dict with 'imputation' (mean, CI bounds) and 'indicating_mask' (boolean mask of originally missing values) - 'timing': Elapsed time for imputation in seconds |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If imputation output has unexpected shape (not 3D or 4D). |
Notes
CSDI generates a 4D output with samples dimension, which is reduced to 3D by taking the first sample. Other models produce 3D output directly.
Source code in src/imputation/impute_with_models.py
log_imputed_artifacts
¶
Save imputation results locally and log to MLflow.
| PARAMETER | DESCRIPTION |
|---|---|
imputation
|
Imputation results dictionary to save.
TYPE:
|
model_name
|
Name of the imputation model for file naming.
TYPE:
|
cfg
|
Configuration (unused but kept for interface consistency).
TYPE:
|
run_id
|
MLflow run ID to log artifacts to.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Path to the saved artifacts file. |
Source code in src/imputation/impute_with_models.py
pypots_imputer_wrapper
¶
Wrapper to impute data across all splits using a PyPOTS model.
Iterates over data splits and applies the trained PyPOTS model to impute missing values in each split.
| PARAMETER | DESCRIPTION |
|---|---|
model
|
Trained PyPOTS imputation model.
TYPE:
|
model_name
|
Name of the model for logging.
TYPE:
|
dataset_dicts
|
Dictionary of datasets keyed by split name (e.g., 'train', 'test').
TYPE:
|
source_data
|
Source data dictionary (unused but kept for interface consistency).
TYPE:
|
cfg
|
Configuration for imputation settings.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary mapping split names to their imputation results. |
Source code in src/imputation/impute_with_models.py
train_utils
¶
if_results_file_found
¶
Check if a results file exists at the specified path.
| PARAMETER | DESCRIPTION |
|---|---|
results_path
|
Full path to the results file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if the file exists, False otherwise. |
Source code in src/imputation/train_utils.py
create_imputation_dict
¶
create_imputation_dict(
imputation_mean: ndarray,
preprocess: Dict[str, Any],
X_missing: ndarray,
cfg: DictConfig,
end_time: Optional[float] = None,
) -> Dict[str, Any]
Create standardized imputation result dictionary.
Formats imputation results into the common structure used across all imputation methods, with optional destandardization.
| PARAMETER | DESCRIPTION |
|---|---|
imputation_mean
|
Imputed values array, shape (samples, timepoints) or (samples, timepoints, features).
TYPE:
|
preprocess
|
Preprocessing dictionary containing 'standardization' with mean and stdev.
TYPE:
|
X_missing
|
Original array with NaN values indicating missing points.
TYPE:
|
cfg
|
Configuration with PREPROCESS.standardize flag.
TYPE:
|
end_time
|
Time taken for imputation in seconds. Default is None.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Standardized imputation dictionary containing: - 'imputation_dict': Dict with 'imputation' (mean, CI bounds) and 'indicating_mask' - 'timing': Elapsed time if provided |
Notes
If input is 2D, a third dimension is added to match expected (samples, timepoints, features) format. Destandardization is applied if configured.
Source code in src/imputation/train_utils.py
create_imputation_dict_from_moment
¶
create_imputation_dict_from_moment(
imputation_mean: ndarray,
indicating_mask: ndarray,
imputation_time: float,
) -> Dict[str, Any]
Create imputation dictionary from MOMENT model outputs.
Formats MOMENT-specific outputs into the common imputation structure.
| PARAMETER | DESCRIPTION |
|---|---|
imputation_mean
|
Imputed values array from MOMENT model.
TYPE:
|
indicating_mask
|
Boolean mask indicating originally missing values.
TYPE:
|
imputation_time
|
Time taken for imputation in seconds.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Standardized imputation dictionary with imputation values, mask, and timing information. |
Source code in src/imputation/train_utils.py
imputation_per_split_of_dict
¶
imputation_per_split_of_dict(
data_dicts: Dict[str, Any],
df: DataFrame,
preprocess: Dict[str, Any],
model: Any,
split: str,
cfg: DictConfig,
) -> Dict[str, Any]
Apply imputation model to a single data split.
Transforms the input DataFrame using the trained model and creates a standardized imputation result dictionary.
| PARAMETER | DESCRIPTION |
|---|---|
data_dicts
|
Data dictionaries (unused but kept for interface consistency).
TYPE:
|
df
|
DataFrame with missing values (NaN) to impute.
TYPE:
|
preprocess
|
Preprocessing dictionary with standardization statistics.
TYPE:
|
model
|
Trained imputation model with transform() method.
TYPE:
|
split
|
Split name for logging.
TYPE:
|
cfg
|
Configuration for imputation settings.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Imputation result dictionary with imputed values and timing. |
Source code in src/imputation/train_utils.py
train_torch_utils
¶
create_torch_dataloader
¶
create_torch_dataloader(
data_dict_df: dict,
task: str,
model_cfg: DictConfig,
split: str,
cfg: DictConfig,
model_name: str = None,
)
Create a PyTorch DataLoader for a specific data split.
Creates a TensorDataset from numpy arrays and wraps it in a DataLoader with the specified configuration.
| PARAMETER | DESCRIPTION |
|---|---|
data_dict_df
|
Data dictionary containing arrays per split.
TYPE:
|
task
|
Task type ('imputation' or 'outlier_detection').
TYPE:
|
model_cfg
|
Model configuration with TORCH.DATASET and TORCH.DATALOADER settings.
TYPE:
|
split
|
Split name ('train', 'test', 'outlier_train', 'outlier_test').
TYPE:
|
cfg
|
Full Hydra configuration.
TYPE:
|
model_name
|
Model name for dataset creation. Default is None.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataLoader
|
PyTorch DataLoader configured for the specified split. |
| RAISES | DESCRIPTION |
|---|---|
NotImplementedError
|
If dataset_type is 'class' (not yet implemented). |
ValueError
|
If dataset_type is unknown. |
Source code in src/imputation/train_torch_utils.py
create_torch_dataloaders
¶
create_torch_dataloaders(
task: str,
model_name: str,
data_dict_df: dict,
model_cfg: DictConfig,
cfg: DictConfig,
create_outlier_dataloaders: bool = True,
)
Create PyTorch DataLoaders for all required data splits.
Creates train and test dataloaders, with optional outlier-specific dataloaders for anomaly detection tasks.
| PARAMETER | DESCRIPTION |
|---|---|
task
|
Task type ('imputation' or 'outlier_detection').
TYPE:
|
model_name
|
Model name for dataset creation.
TYPE:
|
data_dict_df
|
Data dictionary containing arrays per split.
TYPE:
|
model_cfg
|
Model configuration with TORCH settings.
TYPE:
|
cfg
|
Full Hydra configuration.
TYPE:
|
create_outlier_dataloaders
|
Whether to create outlier_train and outlier_test dataloaders. Default is True.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary mapping split names to DataLoaders. Contains 'train' and 'test', plus 'outlier_train' and 'outlier_test' if requested. |
Source code in src/imputation/train_torch_utils.py
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 | |
MissForest¶
missforest_main
¶
missforest_create_imputation_dicts
¶
Create imputation dictionaries from MissForest model outputs.
Transforms MissForest imputation results into the standardized format used by PyPOTS models for downstream processing compatibility.
| PARAMETER | DESCRIPTION |
|---|---|
model
|
Trained MissForest model.
TYPE:
|
df_dict
|
Dictionary of DataFrames keyed by split name ('train', 'test').
TYPE:
|
source_data
|
Source data containing 'df' with data dictionaries per split and 'preprocess' with standardization statistics.
TYPE:
|
cfg
|
Configuration for imputation settings.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary mapping split names to imputation results in PyPOTS-compatible format. |
Source code in src/imputation/missforest_main.py
check_df
¶
Validate and convert DataFrame for MissForest compatibility.
Logs the number of NaN values and ensures all columns are float type to prevent type errors during MissForest fitting.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame with potential NaN values.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with all columns cast to float type. |
Source code in src/imputation/missforest_main.py
missforest_fit_script
¶
Fit a MissForest model on training data.
| PARAMETER | DESCRIPTION |
|---|---|
train_df
|
Training DataFrame with NaN values to learn imputation patterns from.
TYPE:
|
cfg
|
Configuration containing MODELS.MISSFOREST.MODEL parameters.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple
|
(model, results) where model is the fitted MissForest instance and results is a dict with 'train' timing in seconds. |
Source code in src/imputation/missforest_main.py
get_dataframes_from_dict_for_missforest
¶
Convert source data dictionaries to DataFrames for MissForest.
Extracts arrays from source data, applies masks by setting masked values to NaN, and converts to pandas DataFrames.
| PARAMETER | DESCRIPTION |
|---|---|
source_data
|
Source data containing 'df' with 'train' and 'test' splits, each having 'data' with 'X' arrays and 'mask' arrays.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple
|
(df_train, df_test) as pandas DataFrames with NaN values where mask indicates missing data. |
| RAISES | DESCRIPTION |
|---|---|
AssertionError
|
If input arrays contain unexpected NaN values or masking fails. |
Source code in src/imputation/missforest_main.py
missforest_main
¶
missforest_main(
source_data: dict,
model_cfg: DictConfig,
cfg: DictConfig,
model_name: str = None,
run_name: str = None,
)
See e.g. El Badisy et al. (2024) https://doi.org/10.1186/s12874-024-02305-3 Albu et al. (2024) https://arxiv.org/abs/2407.03379 for "missForestPredict" Original paper by Stekhoven and Bühlmann (2012) https://doi.org/10.1093/bioinformatics/btr597 Python https://github.com/yuenshingyan/MissForest / https://pypi.org/project/MissForest/
Source code in src/imputation/missforest_main.py
Artifacts¶
imputation_log_artifacts
¶
pypots_model_logger
¶
pypots_model_logger(
model_obj: Any,
model_name: str,
model_info: dict[str, Any],
artifacts_dir: str,
) -> None
Log a PyPOTS model and its training artifacts to MLflow.
Copies the saved PyPOTS model file and training directory (including TensorBoard logs) to MLflow artifacts.
| PARAMETER | DESCRIPTION |
|---|---|
model_obj
|
Trained PyPOTS model instance with saving_path attribute.
TYPE:
|
model_name
|
Name of the model for file naming.
TYPE:
|
model_info
|
Model information dictionary containing 'num_params'.
TYPE:
|
artifacts_dir
|
Directory for artifacts (unused but kept for interface consistency).
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
FileNotFoundError
|
If the model file is not found at the expected path. |
Source code in src/imputation/imputation_log_artifacts.py
generic_pickled_model_logger
¶
Save a model as pickle and log to MLflow.
Generic model logger for models that don't have specialized saving methods.
| PARAMETER | DESCRIPTION |
|---|---|
model_obj
|
Trained model instance to pickle.
TYPE:
|
model_name
|
Name of the model for file naming.
TYPE:
|
artifacts_dir
|
Directory to save the pickle file.
TYPE:
|
Source code in src/imputation/imputation_log_artifacts.py
log_imputer_model
¶
log_imputer_model(
model_obj: Any,
model_name: str,
artifacts: dict[str, Any],
artifacts_dir: str,
) -> None
Log an imputation model to MLflow using the appropriate method.
Dispatches to the correct logging method based on model type (PyPOTS, MissForest, MOMENT, etc.).
| PARAMETER | DESCRIPTION |
|---|---|
model_obj
|
Trained imputation model instance.
TYPE:
|
model_name
|
Name of the model.
TYPE:
|
artifacts
|
Artifacts dictionary containing 'model_artifacts' with 'model_info'.
TYPE:
|
artifacts_dir
|
Directory for saving artifacts.
TYPE:
|
Notes
PyPOTS models use their specialized save format. MissForest uses pickle. MOMENT models are not currently logged (only results are logged).
Source code in src/imputation/imputation_log_artifacts.py
log_the_imputation_results
¶
log_the_imputation_results(
imputation_artifacts: dict[str, Any],
model_name: str,
artifacts_dir: str,
cfg: DictConfig,
run_name: str,
) -> None
Save imputation results locally and log to MLflow.
| PARAMETER | DESCRIPTION |
|---|---|
imputation_artifacts
|
Dictionary containing imputation results to save.
TYPE:
|
model_name
|
Name of the model for file naming.
TYPE:
|
artifacts_dir
|
Directory to save the pickle file.
TYPE:
|
cfg
|
Configuration (unused but kept for interface consistency).
TYPE:
|
run_name
|
Run name (unused but kept for interface consistency).
TYPE:
|
Source code in src/imputation/imputation_log_artifacts.py
save_and_log_imputer_artifacts
¶
save_and_log_imputer_artifacts(
model: Any,
imputation_artifacts: dict[str, Any],
artifacts_dir: str,
cfg: DictConfig,
model_name: str,
run_name: str,
) -> None
Save and log all imputation artifacts to MLflow.
Orchestrates the logging of model, results, and Hydra configuration artifacts to the associated MLflow run.
| PARAMETER | DESCRIPTION |
|---|---|
model
|
Trained imputation model instance.
TYPE:
|
imputation_artifacts
|
Dictionary containing 'model_artifacts' with MLflow info and results.
TYPE:
|
artifacts_dir
|
Directory for saving artifacts locally.
TYPE:
|
cfg
|
Full Hydra configuration.
TYPE:
|
model_name
|
Name of the imputation model.
TYPE:
|
run_name
|
MLflow run name.
TYPE:
|
Notes
Ends any active MLflow run, then starts a new run context to log artifacts. The run is ended after all artifacts are logged.