meganorm.utils package
Submodules
meganorm.utils.EEGlab module
- class meganorm.utils.EEGlab.RawEEGLAB(input_fname, eog=(), preload=False, *, uint16_codec=None, montage_units='auto', verbose=None)[source]
Bases:
BaseRawRaw object from EEGLAB .set file.
- Parameters:
input_fname (path-like) – Path to the
.setfile. If the data is stored in a separate.fdtfile, it is expected to be in the same folder as the.setfile.eog (list | tuple | 'auto') – Names or indices of channels that should be designated EOG channels. If ‘auto’, the channel names containing
EOGorEYEare used. Defaults to empty tuple.preload (bool or str (default False)) – Preload data into memory for data manipulation and faster indexing. If True, the data will be preloaded into memory (fast, requires large amount of memory). If preload is a string, preload is the file name of a memory-mapped file which is used to store the data on the hard drive (slower, requires less memory). Note that preload=False will be effective only if the data is stored in a separate binary file.
uint16_codec (str | None) – If your set file contains non-ascii characters, sometimes reading it may fail and give rise to error message stating that “buffer is too small”.
uint16_codecallows to specify what codec (for example: ‘latin1’ or ‘utf-8’) should be used when reading character arrays and can therefore help you solve this problem.montage_units (str) –
Units that channel positions are represented in. Defaults to “mm” (millimeters), but can be any prefix + “m” combination (including just “m” for meters).
Added in version 1.3.
verbose (bool | str | int | None) – Control verbosity of the logging output. If
None, use the default verbosity level. See the logging documentation andmne.verbose()for details. Should only be passed as a keyword argument.
See also
mne.io.RawDocumentation of attributes and methods.
Notes
Added in version 0.11.0.
- Attributes:
annotationsAnnotationsfor marking segments of data.ch_namesChannel names.
compensation_gradeThe current gradient compensation grade.
durationDuration of the data in seconds.
filenamesThe filenames used.
first_sampThe first data sample.
first_timeThe first time point (including first_samp but not meas_date).
last_sampThe last data sample.
n_timesNumber of time points.
projWhether or not projections are active.
timesTime points.
Methods
add_channels(add_list[, force_update_info])Append new channels to the instance.
add_events(events[, stim_channel, replace])Add events to stim channel.
add_proj(projs[, remove_existing, verbose])Add SSP projection vectors.
add_reference_channels(ref_channels)Add reference channels to data that consists of all zeros.
anonymize([daysback, keep_his, verbose])Anonymize measurement information in place.
append(raws[, preload])Concatenate raw instances as if they were continuous.
apply_function(fun[, picks, dtype, n_jobs, ...])Apply a function to a subset of channels.
apply_gradient_compensation(grade[, verbose])Apply CTF gradient compensation.
apply_hilbert([picks, envelope, n_jobs, ...])Compute analytic signal or envelope for a subset of channels/vertices.
apply_proj([verbose])Apply the signal space projection (SSP) operators to the data.
close()Clean up the object.
compute_psd([method, fmin, fmax, tmin, ...])Perform spectral analysis on sensor data.
compute_tfr(method, freqs, *[, tmin, tmax, ...])Compute a time-frequency representation of sensor data.
copy()Return copy of Raw instance.
crop([tmin, tmax, include_tmax, verbose])Crop raw data file.
crop_by_annotations([annotations, verbose])Get crops of raw data file for selected annotations.
del_proj([idx])Remove SSP projection vector.
describe([data_frame])Describe channels (name, type, descriptive statistics).
drop_channels(ch_names[, on_missing])Drop channel(s).
export(fname[, fmt, physical_range, ...])Export Raw to external formats.
filter(l_freq, h_freq[, picks, ...])Filter a subset of channels/vertices.
get_channel_types([picks, unique, only_data_chs])Get a list of channel type for each channel.
get_data([picks, start, stop, ...])Get data in the given range.
get_montage()Get a DigMontage from instance.
interpolate_bads([reset_bads, mode, origin, ...])Interpolate bad MEG and EEG channels.
load_bad_channels([bad_file, force, verbose])Mark channels as bad from a text file.
load_data([verbose])Load raw data.
notch_filter(freqs[, picks, filter_length, ...])Notch filter a subset of channels.
pick(picks[, exclude, verbose])Pick a subset of channels.
pick_channels(ch_names[, ordered, verbose])pick_types([meg, eeg, stim, eog, ecg, emg, ...])plot([events, duration, start, n_channels, ...])Plot raw data.
plot_projs_topomap([ch_type, sensors, ...])Plot SSP vector.
plot_psd([fmin, fmax, tmin, tmax, picks, ...])plot_psd_topo([tmin, tmax, fmin, fmax, ...])plot_psd_topomap([bands, tmin, tmax, ...])plot_sensors([kind, ch_type, title, ...])Plot sensor positions.
rename_channels(mapping[, allow_duplicates, ...])Rename channels.
reorder_channels(ch_names)Reorder channels.
resample(sfreq, *[, npad, window, ...])Resample all channels.
rescale(scalings, *[, verbose])Rescale channels.
save(fname[, picks, tmin, tmax, ...])Save raw data to file.
savgol_filter(h_freq[, verbose])Filter the data using Savitzky-Golay polynomial method.
set_annotations(annotations[, emit_warning, ...])Setter for annotations.
set_channel_types(mapping, *[, ...])Specify the sensor types of channels.
set_eeg_reference([ref_channels, ...])Specify which reference to use for EEG data.
set_meas_date(meas_date)Set the measurement start date.
set_montage(montage[, match_case, ...])Set EEG/sEEG/ECoG/DBS/fNIRS channel positions and digitization points.
time_as_index(times[, use_rounding, origin])Convert time to indices.
to_data_frame([picks, index, scalings, ...])Export data in tabular structure as a pandas DataFrame.
- meganorm.utils.EEGlab.read_raw_eeglab(input_fname, eog=(), preload=False, uint16_codec=None, montage_units='auto', verbose=None) RawEEGLAB[source]
Read an EEGLAB .set file.
- Parameters:
input_fname (path-like) – Path to the
.setfile. If the data is stored in a separate.fdtfile, it is expected to be in the same folder as the.setfile.eog (list | tuple |
'auto') – Names or indices of channels that should be designated EOG channels. If ‘auto’, the channel names containingEOGorEYEare used. Defaults to empty tuple.preload (bool or str (default False)) – Preload data into memory for data manipulation and faster indexing. If True, the data will be preloaded into memory (fast, requires large amount of memory). If preload is a string, preload is the file name of a memory-mapped file which is used to store the data on the hard drive (slower, requires less memory). Note that
preload=Falsewill be effective only if the data is stored in a separate binary file.uint16_codec (str | None) – If your set file contains non-ascii characters, sometimes reading it may fail and give rise to error message stating that “buffer is too small”.
uint16_codecallows to specify what codec (for example: ‘latin1’ or ‘utf-8’) should be used when reading character arrays and can therefore help you solve this problem.montage_units (str) –
Units that channel positions are represented in. Defaults to “mm” (millimeters), but can be any prefix + “m” combination (including just “m” for meters).
Added in version 1.3.
Changed in version 1.6: Support for
'auto'was added and is the new default.verbose (bool | str | int | None) – Control verbosity of the logging output. If
None, use the default verbosity level. See the logging documentation andmne.verbose()for details. Should only be passed as a keyword argument.
- Returns:
raw – A Raw object containing EEGLAB .set data. See
mne.io.Rawfor documentation of attributes and methods.- Return type:
instance of RawEEGLAB
See also
mne.io.RawDocumentation of attributes and methods of RawEEGLAB.
Notes
Added in version 0.11.0.
meganorm.utils.IO module
- meganorm.utils.IO.factorize_columns(df: DataFrame, columns: list)[source]
Factorizes specified columns in the DataFrame. For the ‘diagnosis’ column, it assigns 0 to ‘control’ and factorizes the rest.
- Parameters:
df (pandas.DataFrame) – The DataFrame containing the columns to be factorized.
columns (list) – List of column names to be factorized.
- Returns:
df – DataFrame with factorized columns.
- Return type:
pandas.DataFrame
- meganorm.utils.IO.make_config(path=None)[source]
Create a configuration dictionary for a neuroimaging preprocessing pipeline.
This function generates configuration settings for preprocessing, feature extraction, spectral analysis, and other relevant parameters used in processing EEG/MEG data. Optionally, it saves the generated configuration to a JSON file in the specified path.
- Parameters:
path (str, optional) – The directory path where the configuration file should be saved. If not provided, the configuration is not saved to a file.
- Returns:
config – The configuration dictionary containing settings for preprocessing, feature extraction, and analysis.
- Return type:
dict
Notes
The generated configuration includes settings for ICA preprocessing, spectral estimation, and feature extraction for EEG/MEG data.
Default values are provided for the majority of settings.
If path is provided, a .json file containing the configuration will be saved.
- meganorm.utils.IO.make_demo_file_bids(file_dir: str, save_dir: str, id_col: int, age_col: int, *columns) None[source]
Convert formats of demographic data into a single format so it can be used in later stages.
- Parameters:
file_dir (str) – Path to the input demographic file (supports CSV, TSV, or XLSX).
save_dir (str) – Path where the BIDS-formatted demographic file will be saved (as TSV).
id_col (int) – Column index containing the participant ID.
age_col (int) – Column index containing participant age.
*extra_columns (dict) –
Additional column definitions. While age and participants id were defined using positional arguments, extra coulmn modification (e.g., sex and eyes condition) can be revised and converted to a single format across dataset using this function. Each dict can contain:
- ’col_name’: str, required name for the output column. This does not
necessarly match the column name before being passed to this function.
’col_id’: int, index of the column that the revision should be applied to.
- ’single_value’: value to assign to all rows if no col_id and mapping are given.
This can be helpful when all subjects in a dataset have the same properties e.g., eyes open condition.
- ’mapping’: dict, if single value is not defined, value mapping can be passed
to map the initial values to the target values.
- Return type:
None
- meganorm.utils.IO.merge_datasets_with_glob(datasets)[source]
Merges file paths across multiple datasets using glob pattern matching.
This function walks through the provided datasets’ base directories to find subject folders and file paths matching a specified task and file ending. It creates a dictionary mapping each subject to a glob pattern that can be used to aggregate files across multiple runs or sessions.
- Parameters:
datasets (dict) –
Dictionary where each key is a dataset name, and each value is a dictionary with the following keys:
”base_dir” (str): Base directory containing subject subdirectories.
”task” (str): Task keyword to search for in filenames.
”ending” (str): File ending (e.g., ‘.nii.gz’) to filter relevant files.
- Returns:
A dictionary mapping subject IDs to a glob-style path string that aggregates all matching files for that subject. Only subjects with at least one matched file are included.
- Return type:
dict
Notes
This function is designed to assist in scenarios where each subject may have multiple files (e.g., different runs or sessions), and the goal is to create a single pattern that can be used to load all related files for a subject.
- meganorm.utils.IO.merge_fidp_demo(datasets_paths: list, features_dir: str, dataset_names: list, drop_columns: list = ['eyes'])[source]
Merge demographic metadata and extracted features into a single DataFrame.
This function loads demographic data and feature data, assigns a site label to each participant if missing, removes unnecessary columns, and merges demographic information with corresponding extracted features.
- Parameters:
datasets_paths (list) – List of paths to the dataset directories containing demographic files (‘participants_bids.tsv’).
features_dir (str) – Path to the directory containing the extracted features (‘all_features.csv’).
dataset_names (list of str) – List of dataset names corresponding to each dataset path. Used to populate missing ‘site’ information if necessary.
drop_columns (list of str, optional) – Columns to drop from the demographic data before merging. Default is [“eyes”].
- Returns:
data (pandas.DataFrame) – Merged DataFrame containing both demographic information and feature data, with participants indexed as strings.
Raises – —— FileNotFoundError
If the ‘participants_bids.tsv’ file is missing in any of the dataset paths or the ‘all_features.csv’ file is missing in the provided features directory.
- meganorm.utils.IO.normalize_column(df, column='age', normalizer=100)[source]
Normalizes a specified column in the DataFrame by dividing its values by the given normalizer.
- Parameters:
df (pandas.DataFrame) – The DataFrame containing the column to be normalized.
column (str, optional) – The column to be normalized (default is “age”).
normalizer (float or None, optional) – The value by which the column will be divided. If None, the column will not be normalized.
- Returns:
df – DataFrame with the normalized column.
- Return type:
pandas.DataFrame
- Raises:
KeyError – If the specified column does not exist in the DataFrame.
ValueError – If the normalizer is not a positive numeric value.
- meganorm.utils.IO.separate_eyes_open_close_eeglab(input_base_path, output_base_path, annotation_description_open, annotation_description_close, trim_before=5, trim_after=5)[source]
- meganorm.utils.IO.separate_patient_data(df, diagnosis: list)[source]
Separates patients’ data from control data based on the diagnosis column.
- Parameters:
df (pandas.DataFrame) – The DataFrame containing the patient data.
diagnosis (list of str) – A list of diagnosis values used to separate patients’ data.
- Returns:
df (pandas.DataFrame) – The DataFrame containing only control data (after dropping the ‘diagnosis’ column).
df_patient (pandas.DataFrame) – The DataFrame containing the patient data.
- Raises:
KeyError – If the ‘diagnosis’ column is not found in the DataFrame.
- meganorm.utils.IO.storeFooofModels(path, subjId, fooofModels, psds, freqs) None[source]
Stores the periodic and aperiodic results from FOOOF analysis in a pickle file.
This function saves the FOOOF models, the power spectral densities (PSDs), and the associated frequency data for a given subject into a .pickle file. The data is appended to the file for each subject.
- Parameters:
path (str) – Directory path where the results will be saved.
subjId (str) – The subject ID for which the results are saved.
fooofModels (object) – The FOOOF model object containing the periodic and aperiodic components.
psds (ndarray) – Power Spectral Densities (PSDs) calculated for the subject.
freqs (ndarray) – Frequency values corresponding to the PSDs.
- Returns:
This function does not return any value; it writes the results to a file.
- Return type:
None
meganorm.utils.freesurfer module
- meganorm.utils.freesurfer.check_log_for_success(results_directory, subject_ids)[source]
Check the log file for the success message.
- Parameters:
results_directory (str) – Path for the freesurfer results.
subject_ids (list) – List of subject IDs.
- Returns:
List of failed subject Ids.
- Return type:
List
- meganorm.utils.freesurfer.create_slurm_script(subjects_directory, subject_id, results_dir, processing_directory, freesurfer_path, nodes=1, ntasks=1, cpus_per_task=1, mem='16G', time='48:00:00', i_option=True, file_postfix='.nii')[source]
Create a Slurm batch script for running recon-all with given parameters.
- meganorm.utils.freesurfer.freesurfer_QC(results_directory)[source]
Performs Euler number based quality control on the results of Freesurfer.
- Parameters:
results_directory (str) – The path to the Freesurfer results directory.
- Returns:
List of passed QC subjects. qc_failed_samples (list): List of failed QC subjects. missing_samples (list): List of missing subjects.
- Return type:
qc_passed_samples (list)
- meganorm.utils.freesurfer.list_subject_ids(directory, save_path=None)[source]
Retrieves all folders in the given directory as subject IDs, and store them in a text file.
- Parameters:
directory (str) – Path to data directory.
save_path (str) – If specified, path to the text file to save the subject IDs,
None. (e.g. "/home/subjects.txt". Defaults to)
- Returns:
List of subject Ids.
- Return type:
list
- meganorm.utils.freesurfer.prepare_mri_data(mri_directory)[source]
This function is written to prepare the BTNRH MRI data for recon-all processing.
- Parameters:
mri_directory (str) – Directory to MRI data
- meganorm.utils.freesurfer.rerun_failed_subs(failed_subjetcs, subjects_directory, results_directory, processing_directory, freesurfer_path, file_postfix='.nii')[source]
Re-runs Freesurfer recon-all for failed subjects.
- Parameters:
failed_subjetcs (list) – List of failed subjects IDs.
subjects_directory (str) – Path to data.
results_directory (str) – Path to save the results.
processing_directory (str) – Path to save the bash script.
freesurfer_path (str) – Path to freesurfer.
- meganorm.utils.freesurfer.run_parallel_reconall(subjects_directory, results_directory, processing_directory, freesurfer_path, file_postfix='.nii')[source]
Runs Freesurfer recon-all in parallel on an Slurm cluster.
- Parameters:
subjects_directory (str) – Path to data.
results_directory (str) – Path to save the results.
processing_directory (str) – Path to save the bash script.
freesurfer_path (str) – Path to freesurfer.
file_postfix (str) – file postfix for nifti files (could be different from one dataset to another).
- Returns:
A list of subject IDs.
meganorm.utils.nm module
- meganorm.utils.nm.abnormal_probability(processing_dir: str, nm_processing_dir: str, n_permutation: int = 1000, site_id: int = None, healthy_data_prefix: str = '', patient_data_prefix: str = '')[source]
Computes the abnormality probability index for both control and patient groups based on z-scores from normative modeling. Then calculates the AUC between these two groups and estimates the statistical significance of AUC values using permutation testing. Finally, it applies false discovery rate (FDR) correction to the p-values.
- Parameters:
processing_dir (str) – Path to the directory containing z-score files.
nm_processing_dir (str) – Path to normative modeling directory containing batch info.
n_permutation (int, optional) – Number of permutations for statistical testing (default is 1000).
site_id (int, optional) – If provided, filters both healthy and patient data by this site ID.
healthy_data_prefix (str, optional) – Prefix used for healthy subject files (e.g., ‘control’).
patient_data_prefix (str, optional) – Prefix used for patient subject files (e.g., ‘patient’).
- Returns:
p_val (np.ndarray) – Adjusted p-values for each biomarker based on FDR correction.
auc (np.ndarray) – AUC values comparing abnormal probability between groups.
- meganorm.utils.nm.aggregate_metrics_across_runs(path: str, method_name: str, biomarker_names: list, valcovfile_path: str, valrespfile_path: str, valbefile: str, metrics: list = ['skewness', 'kurtosis', 'W', 'MACE', 'SMSE'], num_runs: int = 10, quantiles: list = [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99], outputsuffix: str = 'estimate', zscore_clipping_value: float = 8.0)[source]
Aggregates statistical metrics across multiple runs for given biomarkers.
This function evaluates and aggregates 4 statistical metrics, namely skewness, kurtosis, mean absolute centiles error (MACE), and W, for a set of biomarkers across multiple runs. The resulting data can be used later for plotting. See also: plot_metrics().
- Parameters:
path (str) – The directory path containing the individual run folders.
method_name (str) – The name of the method folder within each run’s directory. Since different HBR configurations can be saved in each run directory, method_name should be specified.
biomarker_names (list of str) – A list of biomarker names for which metrics are to be calculated.
valcovfile_path (str) – The file path to the validation covariance matrix.
valrespfile_path (str) – The file path to the validation response file.
valbefile (str) – The file path to the validation bivariate evaluation file.
metrics (list of str, optional) – A list of metrics to compute for each biomarker. Options include “skewness”, “kurtosis”, “W”, and “MACE”. Default is [“skewness”, “kurtosis”, “W”, “MACE”].
num_runs (int, optional) – The number of runs to aggregate metrics across. Default is 10.
quantiles (list of float, optional) – A list of quantiles to use for MACE evaluation. Default is [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99]. This tells the function to calculate MACE for these centiles.
outputsuffix (str, optional) – The suffix to append to output files (e.g., for naming model outputs). Default is “estimate”.
zscore_clipping_value (float, optional) – The maximum z-score value for clipping. Any z-score above this threshold will be clipped to this value. Default is 8.0. This is due to the sensitivity of kurtosis to noise. Given that |z| > 8 is almost as equal as |z| = 8, we clip them to 8.
- Returns:
data – A dictionary where keys are the metric names (e.g., “skewness”, “kurtosis”, “W”, “MACE”) and values are dictionaries with biomarker names as keys and lists of aggregated metric values across runs as values.
- Return type:
dict
Notes
The function performs z-score clipping to limit extreme values, applies the skewness and kurtosis calculations, evaluates MACE using the provided validation data, and computes the W statistic for the test data.
Example
- data = aggregate_metrics_across_runs(
path=’/path/to/runs’, method_name=’method_A’, biomarker_names=[‘biomarker_1’, ‘biomarker_2’], valcovfile_path=’/path/to/valcovfile’, valrespfile_path=’/path/to/valrespfile’, valbefile=’/path/to/valbefile’, metrics=[‘MACE’, ‘W’], num_runs=5
)
- meganorm.utils.nm.cal_stats_for_INOCs(q_path: str, features: list, site_id: int, sex_id: int, age: float, num_of_datasets: int, num_points: int = 100) dict[source]
Calculates population statistics (centiles of variation) give a subject age, sex and site.
- Parameters:
q_path (str) – Path to the pickled file containing ‘quantiles’, ‘synthetic_X’, and ‘batch_effects’. This is the output of ‘estimate_centiles()’ function.
features (list of str) – List of biomarker feature names.
site_id (int) – Index representing the participant’s site. If None, averages across all sites.
sex_id (int) – Index representing the participant’s sex.
age (float) – Age of the participant.
num_of_datasets (int) – Number of datasets used to generate quantiles.
num_points (int, optional) – Number of points for synthetic X axis (default is 100).
- Returns:
Dictionary mapping each feature to a list of statistics across quantiles at the given age.
- Return type:
dict
- meganorm.utils.nm.calculate_PNOCs(quantiles_path, gender_ids, frequency_band_model_ids, quantile_id=2, site_id=None, point_num=100, sex_batch_ind=0, site_batch_ind=1, num_of_sexs=2, num_of_datasets=None, age_slices=None)[source]
Prepares the data required for the plot_PNOCs function.
This function slices the covariate into multiple bins and calculates the mean and standard deviation of each frequency band across the population for both sexes.
- Parameters:
quantiles_path (str) – Path to a pickle file containing the keys: ‘quantiles’, ‘synthetic_X’, and ‘batch_effects’.
gender_ids (dict) – Dictionary mapping gender labels (e.g., {“male”: 0, “female”: 1}) to their batch indices.
frequency_band_model_ids (dict) – Dictionary mapping frequency band names (e.g., {“alpha”: 0, “beta”: 1}) to model indices.
quantile_id (int, optional) – Index of the quantile to use from the loaded quantiles array (default is 2). This number corresponds to the ith element of the computed percentiles. If the computed percentiles were [0.05, 0.25, 0.5, 0.75, 0.95], then ‘quantile_id=2’ corresponds to 0.5.
site_id (int, optional) – Site ID to condition the P-NOCs on. If None, PNOCs from all sites are averaged (default is None).
point_num (int, optional) – Number of synthetic data points used in deriving quantiles (default is 100).
sex_batch_ind (int, optional) – Index in the batch array corresponding to sex (default is 0).
site_batch_ind (int, optional) – Index in the batch array corresponding to site (default is 1).
num_of_sexs (int, optional) – Number of sex categories (default is 2).
num_of_datasets (int, optional) – Number of datasets used in data aggregation (required if site_id is None).
age_slices (array-like of int, optional) – Array of starting ages to define age bins. If None, defaults to np.arange(5, 80, 5).
- Returns:
oscilogram (dict) – Nested dictionary with structure: oscilogram[gender][frequency_band] = list of [mean, std] values for each age slice.
age_slices (numpy.ndarray) – Array of age slice start values used for binning.
Notes
- The input pickle file must contain:
‘quantiles’: array of shape (n_samples, n_quantiles, n_models)
‘synthetic_X’: array of age values of shape (n_samples, 1)
‘batch_effects’: array of shape (n_samples, n_batch_dims)
- meganorm.utils.nm.estimate_centiles(processing_dir, bio_num, quantiles=[0.05, 0.25, 0.5, 0.75, 0.95], batch_sizes=[2, 6], age_range=(0, 100), point_num=100, outputsuffix='estimate', save=True)[source]
Estimate centile curves using a normative model for synthetic subjects across batch combinations.
- Parameters:
processing_dir (str) – Path to the normative modeling output directory (Models, log, and batch files).
bio_num (int) – Number of biomarkers or target variables (i.e., number of models to load).
quantiles (list of float, optional) – List of quantiles to estimate (default is [0.05, 0.25, 0.5, 0.75, 0.95]).
batch_sizes (list of int, optional) – List indicating number of levels for each batch variable. Example: [2, 2] for two binary batch variables (e.g., sex and site).
age_range (tuple of float, optional) – Age range over which to generate synthetic samples (default is (0, 100)).
point_num (int, optional) – Number of age points per batch combination (default is 100).
outputsuffix (str, optional) – Suffix used when loading model output files (default is ‘estimate’).
save (bool, optional) – If True, saves the estimated quantiles and synthetic inputs to disk (default is True).
- Returns:
q – Estimated quantile array of shape (N, Q, B) where: - N is the number of synthetic points, - Q is the number of quantiles, - B is the number of biomarkers.
- Return type:
np.ndarray
- meganorm.utils.nm.evaluate_mace(model_path, X_path, y_path, be_path, save_path=None, model_id=0, quantiles=[0.05, 0.25, 0.5, 0.75, 0.95], plot=False, outputsuffix='ms')[source]
Evaluate model calibration using the Mean Absolute Calibration Error (MACE) metric.
This function computes MACE by comparing model-predicted quantiles with the empirical distribution of outcomes across batch groups. Optionally, it plots a reliability diagram to visually assess calibration performance.
- Parameters:
model_path (str) – Path to the directory containing the saved model and its metadata.
X_path (str) – Path to the test covariates (.pkl file), expected as a pandas DataFrame.
y_path (str) – Path to the true test responses (.pkl file), expected as a pandas DataFrame.
be_path (str) – Path to the batch effect file (.pkl file), with each column as a batch dimension.
save_path (str, optional) – Directory to save the reliability diagram if plot is True. Required when plotting.
model_id (int, optional) – Index of the model (biomarker) to evaluate. Corresponds to index X in ‘NM_0_X_<suffix>.pkl’.
quantiles (list of float, optional) – Quantiles to use for computing calibration (default: [0.05, 0.25, 0.5, 0.75, 0.95]).
plot (bool, optional) – Whether to generate and save a reliability diagram (default: False).
outputsuffix (str, optional) – Suffix of the saved model filename (default: “ms”).
- Returns:
Mean absolute calibration error (MACE) across all batches and batch IDs.
- Return type:
float
Notes
This function assumes all inputs are pickled files in the expected format.
Empirical quantiles are computed within each batch group and compared to the target quantiles.
Plotting requires matplotlib and seaborn.
- Input file formats:
X_path: shape (n_samples, n_features)
y_path: shape (n_samples, n_outputs)
be_path: shape (n_samples, n_batch_dims)
- meganorm.utils.nm.hbr_data_split(data, save_path, covariates=['age'], batch_effects=None, train_split=0.5, validation_split=None, drop_nans=False, random_seed='23d', prefix='', stratification_columns=['site', 'sex'])[source]
Splits a given DataFrame into training, validation, and test sets for normative modeling, while considering stratification based on specified categorical columns. The data is saved as pickled files for normative modeling (PCNToolkit requires paths to the files).
- Parameters:
data (pd.DataFrame) – A Pandas DataFrame containing the data to be split. Created using functions like “load_camcan_data”.
save_path (str) – Path where the resulting training, validation, and test sets will be saved as pickled files.
covariates (list of str, optional, default=["age"]) – List of covariates to be used in the analysis (default is [“age”]).
batch_effects (list of str, optional, default=None) – List of batch effects to be accounted for in the HBR model. Default is None.
train_split (float, optional, default=0.5) – Proportion of the data to be used for training (default is 0.5).
validation_split (float, optional, default=None) – Proportion of the training data to be used for validation (default is None, meaning no validation set is created).
drop_nans (bool, optional, default=False) – If True, rows with missing values are dropped (default is False).
random_seed (int or str, optional, default="23d") – Seed for random number generation to ensure reproducibility (default is 23d).
prefix (str, optional, default="") – Prefix to be added to the filenames when saving the pickled data (default is “”).
stratification_columns (list of str, optional, default=["site", "sex"]) – List of categorical columns used for stratification during splitting (default is [“site”, “sex”]).
- Returns:
A list of biomarker names (columns in the target y DataFrame), which represent the dependent variables for the HBR normative modeling.
- Return type:
list of str
Notes
- The function performs the following steps:
Drops any rows with missing values if drop_nans=True.
Creates a new column “combination” based on the specified stratification columns.
Splits the data into training, validation (optional), and test sets while preserving the stratification.
Saves the resulting splits (x_train, y_train, b_train, etc.) as pickled files in the specified save_path.
Saves the random seed used for splitting into a separate pickled file.
Returns the names of the biomarkers (columns in y_train).
Example
- biomarker_names = hbr_data_split(
data=df, save_path=”./data_split/”, covariates=[“age”, “sex”], batch_effects=[“site”], train_split=0.7, validation_split=0.2, random_seed=42
)
- meganorm.utils.nm.prepare_prediction_data(data: DataFrame, save_path: str, covariates: list[str] = ['age'], batch_effects: list[str] = None, drop_nans: bool = False, prefix: str = '') None[source]
Prepares and saves test data (covariates, batch effects, and targets) for normative model prediction.
- Parameters:
data (pd.DataFrame) – Input dataframe containing covariates, batch effects, and target biomarkers.
save_path (str) – Directory to save the output .pkl files.
covariates (list of str, optional) – List of column names to be used as covariates (default is [“age”]).
batch_effects (list of str, optional) – List of column names to be treated as batch effects. If None, a dummy batch column is used.
drop_nans (bool, optional) – Whether to drop rows containing NaN values (default is False).
prefix (str, optional) – Prefix for the saved .pkl file names (default is “”).
Saves
-----
{prefix}x_test.pkl (-)
{prefix}y_test.pkl (-)
{prefix}b_test.pkl (-)
- Return type:
None
- meganorm.utils.nm.shapiro_stat(z_scores, covariates, n_bins=10)[source]
Computes Shapiro-Wilk test statistics for z-scores stratified by covariate bins.
The z-scores are grouped into bins based on the values of the covariate, and the Shapiro-Wilk test for normality is applied within each bin for every feature. The function returns the average Shapiro-Wilk statistic across all bins for each biomarker.
- Parameters:
z_scores (numpy.ndarray) – A 2D array of shape (n_samples, n_features) containing the z-scores for each subject and feature.
covariates (numpy.ndarray) – A 1D or 2D array of shape (n_samples,) or (n_samples, 1) containing the covariate values used for binning.
n_bins (int, optional) – The number of equal-width bins to divide the covariate range into. Default is 10.
- Returns:
A 1D array of length n_features, where each element is the mean Shapiro-Wilk test statistic across bins for the corresponding feature. NaN is returned for bins with fewer than 3 samples.
- Return type:
numpy.ndarray
Notes
The Shapiro-Wilk test is only performed for bins with at least 3 samples. Bins with fewer samples contribute NaN to the average.
The output values range from 0 to 1, where values closer to 1 suggest better adherence to a normal distribution.
- meganorm.utils.nm.wilcoxon_rank_test(proposed_dict, baseline_dict)[source]
Applies the Wilcoxon rank-sum test to compare metric distributions between two model configurations across multiple biomarkers. Applies FDR correction (Benjamini-Hochberg) to the resulting p-values.
- Parameters:
proposed_dict (dict) – Dictionary of metrics for the proposed model configuration. Expected format: {metric: {biomarker: list of values}}.
baseline_dict (dict) – Dictionary of metrics for the baseline model configuration. Same format as proposed_dict.
- Returns:
stat_df (pandas.DataFrame) – DataFrame of Wilcoxon rank-sum test statistics. Rows = metrics, Columns = biomarkers.
pval_df (pandas.DataFrame) – DataFrame of uncorrected p-values.
fdr_corrected_df (pandas.DataFrame) – DataFrame of Benjamini-Hochberg FDR-corrected p-values.
meganorm.utils.parallel module
- meganorm.utils.parallel.auto_parallel_feature_extraction(mainParallel_path, features_dir, subjects, job_configs, config_file, username=None, auto_rerun=True, auto_collect=True, max_try=3)[source]
Automatically submits, monitors, and reruns jobs for feature extraction on multiple subjects, and collects the results.
- Parameters:
mainParallel_path (str) – Path to the mainParallel.py script that will be executed in parallel for each subject.
features_dir (str) – Path to the directory where the feature extraction results and temporary files will be saved.
subjects (dict) – A dictionary of subject names (keys) and their corresponding file paths (values).
job_configs (dict) – Dictionary containing job configuration settings (e.g., memory, time, partition, etc.).
config_file (str) – Path to a JSON configuration file containing additional settings for the feature extraction jobs.
username (str, optional) – The SLURM username. If not provided, it will be fetched from the environment. Default is None.
auto_rerun (bool, optional) – Whether to automatically rerun failed jobs. Default is True.
auto_collect (bool, optional) – Whether to automatically collect and merge results after job completion. Default is True.
max_try (int, optional) – The maximum number of retry attempts for failed jobs. Default is 3.
- Returns:
A list of failed jobs after all attempts. If no jobs failed, the list will be empty.
- Return type:
list
- meganorm.utils.parallel.check_jobs_status(username, start_time, delay=20)[source]
Checks the status of submitted jobs to the SLURM cluster.
- Parameters:
username (str) – The SLURM username used to check the status of the jobs.
start_time (str) – The start time for the batch job submission, formatted as ‘YYYY-MM-DDTHH:MM:SS’. This is used to identify the specific set of jobs submitted in the submit_jobs function.
delay (int, optional) – The delay, in seconds, between each check of job status. Default is 20 seconds.
- Returns:
A list of names of jobs that have failed.
- Return type:
list
- meganorm.utils.parallel.check_user_jobs(username, start_time)[source]
Utility function for counting the status of jobs submitted to the SLURM scheduler.
- Parameters:
username (str) – The SLURM username used to check the status of the jobs.
start_time (str) – The start time for the batch job submission, formatted as ‘YYYY-MM-DDTHH:MM:SS’. This is used to filter the jobs that were submitted after the specified start time.
- Returns:
A tuple containing: - status_counts : dict
A dictionary with counts of jobs in various states (PENDING, RUNNING, COMPLETED, FAILED, CANCELLED).
- failed_jobslist
A list of job names that have failed.
- Return type:
tuple
- meganorm.utils.parallel.collect_results(target_dir, subjects, temp_path, file_name='features', clean=True)[source]
Collects and merges the results of all jobs into a single file.
- Parameters:
target_dir (str) – Path to the target directory where the merged results will be saved.
subjects (dict) – A dictionary with subject names as keys and their corresponding file paths as values.
temp_path (str) – Path to the temporary directory where individual subject result files are stored.
file_name (str, optional) – The name of the file where the merged results will be saved. Default is ‘features’.
clean (bool, optional) – Whether to remove the temporary files after merging the results. Default is True.
- Returns:
This function does not return anything but writes the merged results to a CSV file in the target directory.
- Return type:
None
- meganorm.utils.parallel.progress_bar(current, total, bar_length=20)[source]
Displays or updates a console progress bar.
- Parameters:
current (int) – The current progress (must be between 0 and total).
total (int) – The total steps for complete progress.
bar_length (int, optional) – The character length of the progress bar. Default is 20.
- meganorm.utils.parallel.sbatchfile(mainParallel_path, bash_file_path, log_path=None, module='mne', time='1:00:00', memory='20GB', partition='normal', core=1, node=1, batch_file_name='batch_job', with_config=True)[source]
Generates a batch script file for submission to a job scheduler (e.g., SLURM) for parallel execution.
- Parameters:
mainParallel_path (str) – Path to the mainParallel.py script that will be executed in the batch job.
bash_file_path (str) – Path where the generated batch job file will be saved.
log_path (str, optional) – Path to the log file where output from the job will be saved. Default is None.
module (str, optional) – The module to load in the batch job environment. Default is ‘mne’.
time (str, optional) – Maximum wall time for the job (format: HH:MM:SS). Default is ‘1:00:00’.
memory (str, optional) – Amount of memory allocated for the job (e.g., ‘20GB’). Default is ‘20GB’.
partition (str, optional) – The partition or queue to submit the job to. Default is ‘normal’.
core (int, optional) – Number of CPU cores to allocate for the job. Default is 1.
node (int, optional) – Number of nodes to request for the job. Default is 1.
batch_file_name (str, optional) – Name for the generated batch job file. Default is ‘batch_job’.
with_config (bool, optional) – Whether to include the configuration in the batch file. Default is True.
- Returns:
This function generates a batch script file and saves it to the specified path.
- Return type:
None
- meganorm.utils.parallel.submit_jobs(mainParallel_path, bash_file_path, subjects, temp_path, config_file=None, job_configs=None, progress=False)[source]
Submits jobs for each subject to the SLURM cluster for parallel execution.
- Parameters:
mainParallel_path (str) – Path to the mainParallel.py script that will be executed in the batch job.
bash_file_path (str) – Path where the generated batch job file will be saved.
subjects (dict) – A dictionary of subject names (keys) and their corresponding paths (values). Each subject will have a job submitted to the cluster.
temp_path (str) – Path where temporary files will be stored.
config_file (str, optional) – Path to a JSON configuration file. If provided, this will be passed to the batch job. Default is None.
job_configs (dict, optional) – Dictionary containing job-specific configurations (e.g., memory, time, partition). Defaults to None, in which case default configurations will be used.
progress (bool, optional) – Whether to show a progress bar during job submission. Default is False.
- Returns:
The start time for the batch job submission, formatted as ‘YYYY-MM-DDTHH:MM:SS’.
- Return type:
str