summarization¶
Results summarization and analysis.
Overview¶
Functions for summarizing and analyzing experiment results.
flow_summarization
¶
get_summarization_data
¶
Collect summarization data from all pipeline stages.
Gathers results from outlier detection, imputation, featurization, and classification experiments into a unified dictionary.
| PARAMETER | DESCRIPTION |
|---|---|
cfg
|
Configuration containing PREFECT.FLOW_NAMES for each stage.
TYPE:
|
experiment_name
|
Name of the summary experiment.
TYPE:
|
summary_exp_name
|
MLflow experiment name for summaries.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary with keys 'outlier_detection', 'imputation', 'featurization', and 'classification', each containing that stage's summarization data. |
Source code in src/summarization/flow_summarization.py
flow_summarization
¶
Main summarization flow for the PLR pipeline.
Orchestrates the collection, analysis, and export of results from all pipeline stages. Initializes MLflow tracking and coordinates data import and analysis tasks.
| PARAMETER | DESCRIPTION |
|---|---|
cfg
|
Configuration dictionary containing: - PREFECT.FLOW_NAMES: Experiment names for each stage - SUMMARIZATION: Import/export settings
TYPE:
|
Source code in src/summarization/flow_summarization.py
summarize_classification
¶
import_cls_artifacts
¶
Import classification artifacts from MLflow runs.
Downloads bootstrap metrics and baseline model results for each classification run.
| PARAMETER | DESCRIPTION |
|---|---|
mlflow_runs
|
DataFrame containing MLflow run metadata with columns 'run_id', 'tags.mlflow.runName', 'params.model_name'.
TYPE:
|
cfg
|
Configuration dictionary.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary mapping run names to their metrics: - 'bootstrap': Bootstrap evaluation results - 'baseline': Baseline model results (if available) |
Source code in src/summarization/summarize_classification.py
get_classification_summary_data
¶
Get summary data for classification experiment.
Retrieves all MLflow runs from the classification experiment and imports their artifacts (bootstrap metrics, baseline results).
| PARAMETER | DESCRIPTION |
|---|---|
cfg
|
Configuration dictionary.
TYPE:
|
experiment_name
|
Name of the classification experiment in MLflow.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary containing: - 'data_df': Placeholder DataFrame - 'mlflow_runs': DataFrame of all classification runs - 'artifacts_dict_summary': Metrics dictionary per run |
Source code in src/summarization/summarize_classification.py
summarization_data_wrangling
¶
export_summary_db_to_mlflow
¶
export_summary_db_to_mlflow(
data: Dict[str, Any],
db_path: str,
artifact_path: str,
summary_experiment_name: str,
experiment_name: str,
cfg: DictConfig,
) -> None
Export summary database and artifacts to MLflow.
Logs the DuckDB database file and artifacts pickle to MLflow, along with metadata about the number of unique runs.
| PARAMETER | DESCRIPTION |
|---|---|
data
|
Summary data containing 'data_df' and 'artifacts_dict_summary'.
TYPE:
|
db_path
|
Path to the DuckDB database file.
TYPE:
|
artifact_path
|
Path to the artifacts pickle file.
TYPE:
|
summary_experiment_name
|
Name of the MLflow experiment for summaries.
TYPE:
|
experiment_name
|
Name of the source experiment being summarized.
TYPE:
|
cfg
|
Configuration dictionary.
TYPE:
|
Source code in src/summarization/summarization_data_wrangling.py
import_summary_db_from_mlflow
¶
import_summary_db_from_mlflow(
experiment_name: str,
summary_exp_name: str,
cfg: DictConfig,
) -> Dict[str, DataFrame]
Import summary database from MLflow artifacts.
Downloads the most recent DuckDB database file from MLflow and reads it into a dictionary of DataFrames.
| PARAMETER | DESCRIPTION |
|---|---|
experiment_name
|
Name of the source experiment to import summaries for.
TYPE:
|
summary_exp_name
|
Name of the MLflow summary experiment.
TYPE:
|
cfg
|
Configuration dictionary.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary containing 'data_df' and 'mlflow_runs' DataFrames. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If no runs are found in the MLflow experiment. |
Source code in src/summarization/summarization_data_wrangling.py
import_summary_dataframe_from_duckdb
¶
Read summary DataFrames from a DuckDB database file.
| PARAMETER | DESCRIPTION |
|---|---|
db_path
|
Path to the DuckDB database file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary containing: - 'data_df': Main data DataFrame - 'mlflow_runs': MLflow run metadata DataFrame |
Source code in src/summarization/summarization_data_wrangling.py
export_summary_dataframe_to_duckdb
¶
export_summary_dataframe_to_duckdb(
db_path: str,
data: Dict[str, Any],
debug_DuckDBWrite: bool = False,
) -> str
Export summary DataFrames to a DuckDB database file.
Creates tables for data_df and mlflow_runs in the database. Overwrites any existing database at the specified path.
| PARAMETER | DESCRIPTION |
|---|---|
db_path
|
Path where the DuckDB database will be created.
TYPE:
|
data
|
Dictionary containing 'data_df' and 'mlflow_runs' DataFrames.
TYPE:
|
debug_DuckDBWrite
|
If True, reads back the database to verify write success. Default is False.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Path to the created database file. |
Source code in src/summarization/summarization_data_wrangling.py
export_summarization_flow_data
¶
export_summarization_flow_data(
data: Dict[str, Any],
experiment_name: str,
summary_experiment_name: str,
cfg: DictConfig,
) -> None
Export complete summarization flow data to disk and MLflow.
Saves data to DuckDB database and artifacts pickle file, then logs both to MLflow for reproducibility.
| PARAMETER | DESCRIPTION |
|---|---|
data
|
Summary data dictionary containing DataFrames and artifacts.
TYPE:
|
experiment_name
|
Name of the source experiment being summarized.
TYPE:
|
summary_experiment_name
|
Name of the MLflow summary experiment.
TYPE:
|
cfg
|
Configuration dictionary.
TYPE:
|
Source code in src/summarization/summarization_data_wrangling.py
flatten_data_per_split
¶
Flatten multi-dimensional data arrays into a single DataFrame.
Converts nested data arrays from (subjects x timepoints) format into a flat format suitable for DataFrame storage.
| PARAMETER | DESCRIPTION |
|---|---|
split
|
Data split identifier (unused, kept for API consistency).
TYPE:
|
split_data_dict
|
Dictionary with 'data' key containing {variable: array} mappings.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with flattened data, one column per variable. |
Source code in src/summarization/summarization_data_wrangling.py
create_dataframe_from_single_source
¶
Create a combined DataFrame from a single data source.
Flattens data across all splits and combines them into a single DataFrame with source identification.
| PARAMETER | DESCRIPTION |
|---|---|
source_data
|
Dictionary containing 'df' with nested split data.
TYPE:
|
source_name
|
Identifier for the data source (e.g., method name).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Combined DataFrame with all splits and source_name column. |
Source code in src/summarization/summarization_data_wrangling.py
get_artifacts_dict
¶
get_artifacts_dict(
mlflow_run: Optional[Series], experiment_name: str
) -> Optional[Dict[str, Any]]
Load artifacts dictionary from an MLflow run.
Downloads and loads the pickled artifacts (metrics, predictions) from the specified MLflow run.
| PARAMETER | DESCRIPTION |
|---|---|
mlflow_run
|
MLflow run metadata as a pandas Series. None for ground truth sources.
TYPE:
|
experiment_name
|
Name of the experiment to determine artifact subdirectory.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict or None
|
Loaded artifacts dictionary, or None if mlflow_run is None. |
Notes
For imputation tasks, 'source_data' is removed from artifacts to save RAM.
Source code in src/summarization/summarization_data_wrangling.py
concatenate_dataframes_from_disk
¶
Concatenate multiple CSV files into a single DataFrame.
Reads temporary CSV files from disk and combines them into one DataFrame. Used to reduce memory usage during processing.
| PARAMETER | DESCRIPTION |
|---|---|
df_sources_tmp_files
|
List of paths to temporary CSV files.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Combined DataFrame from all source files. |
Source code in src/summarization/summarization_data_wrangling.py
get_data_from_sources
¶
get_data_from_sources(
sources: Dict[str, Dict[str, Any]],
experiment_name: str,
cfg: DictConfig,
) -> Dict[str, Any]
Extract and combine data from multiple experiment sources.
Processes each source's data into a DataFrame, saves to temporary files to manage memory, and collects MLflow run metadata and artifacts.
| PARAMETER | DESCRIPTION |
|---|---|
sources
|
Dictionary mapping source names to their data dictionaries.
TYPE:
|
experiment_name
|
Name of the experiment being summarized.
TYPE:
|
cfg
|
Configuration dictionary.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary containing: - 'data_df': Combined DataFrame from all sources - 'mlflow_runs': DataFrame of MLflow run metadata - 'artifacts_dict_summary': Dictionary of artifacts per source |
Source code in src/summarization/summarization_data_wrangling.py
get_detaframe_from_features
¶
Create summary DataFrame from featurization results.
Combines feature data from multiple sources and splits into a single DataFrame with featurization type annotations.
| PARAMETER | DESCRIPTION |
|---|---|
features
|
Dictionary mapping feature source names to their data and metadata.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary containing: - 'data_df': Combined features DataFrame - 'mlflow_runs': MLflow run metadata - 'artifacts_dict_summary': Empty dict (placeholder) |
Source code in src/summarization/summarization_data_wrangling.py
get_summarization_flow_data
¶
get_summarization_flow_data(
cfg: DictConfig,
experiment_name: str,
summary_exp_name: str,
) -> dict
Get or generate summarization data for an experiment.
Either imports existing summaries from DuckDB/MLflow or generates new summaries by processing all experiment sources.
| PARAMETER | DESCRIPTION |
|---|---|
cfg
|
Configuration with SUMMARIZATION settings.
TYPE:
|
experiment_name
|
Name of the experiment to summarize.
TYPE:
|
summary_exp_name
|
Name of the summary experiment in MLflow.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Summary data dictionary containing DataFrames and artifacts. |
Source code in src/summarization/summarization_data_wrangling.py
summary_analysis_main
¶
summary_analysis_main
¶
Execute main summary analysis on collected flow results.
Placeholder for analysis logic that processes summarized data from all pipeline stages (outlier detection, imputation, featurization, classification).
| PARAMETER | DESCRIPTION |
|---|---|
flow_results
|
Dictionary containing summarization data from each pipeline stage.
TYPE:
|
cfg
|
Configuration dictionary.
TYPE:
|
Notes
Currently a stub - analysis logic to be implemented.