neurobench.datasets
Google Speech Commands
The Google Speech Commands dataset (V2) is a commonly used dataset in assessing the performance of keyword spotting algorithms. The dataset consists of 105,829 1 second utterances of 35 different words from 2,618 distinct speakers. The data is encoded as linear 16-bit, single-channel, pulse code modulated values, at a 16 kHz sampling frequency.
- class neurobench.datasets.speech_commands.SpeechCommands(*args: Any, **kwargs: Any)[source]
Bases:
NeuroBenchDataset
,SPEECHCOMMANDS
Speech commands dataset v0.02 with 35 keywords.
Wraps the torchaudio SPEECHCOMMANDS dataset.
- __getitem__(idx)[source]
Getter method for dataset.
- Parameters:
idx (int) – index of sample to return
- Returns:
waveform of audio sample label (torch.Tensor): label index of audio sample
- Return type:
waveform (torch.Tensor)
- __init__(path, subset: str | None = None, truncate_or_pad_to_1s=True)[source]
Initializes the SpeechCommands dataset.
- Parameters:
path (str) – path to the root directory of the dataset
subset (str, optional) – one of “training”, “validation”, or “testing”. Defaults to None.
truncate_or_pad_to_1s (bool, optional) – whether to truncate or pad samples to 1s. Defaults to True.
Prophesee Megapixel Automotive
The Prophesee 1 Megapixel Automotive Detection Dataset was recorded with a high-resolution event camera with a 110 degree field of view mounted on a car windshield. The car was driven in various areas under different daytime weather conditions over several months. The dataset was labeled using the video stream of an additional RGB camera in a semi-automated way, resulting in over 25 million bounding boxes for seven different object classes: pedestrian, two-wheeler, car, truck, bus, traffic sign, and traffic light. The labels are provided at a rate of 60Hz, and the recording of 14.65 hours is split into 11.19, 2.21, and 2.25 hours for training, validation, and testing, respectively.
- class neurobench.datasets.megapixel_automotive.Gen4DetectionDataLoader(*args: Any, **kwargs: Any)[source]
Bases:
SequentialDataLoader
NeuroBench DataLoader for Gen4 pre-computed dataset.
The default parameters are set for the Gen4 Histograms dataset, which can be downloaded from https://docs.prophesee.ai/stable/datasets.html#precomputed-datasets but you can change that easily by downloading one of the other pre-computed datasets and changing the preprocess_function_name and channels parameters accordingly.
Once downloaded, extract the zip folder and set the dataset_path parameter to the path of the extracted folder.
- __init__(dataset_path='data/Gen 4 Histograms', split='testing', batch_size: int = 4, num_tbins: int = 12, preprocess_function_name='histo', delta_t=50000, channels=2, height=360, width=640, max_incr_per_pixel=5, class_selection=['pedestrian', 'two wheeler', 'car'], num_workers=4)[source]
Initializes the Gen4DetectionDataLoader dataloader.
- Parameters:
dataset_path – path to the dataset folder
split – split to use, can be ‘training’, ‘validation’ or ‘testing’
batch_size – batch size
num_tbins – number of time bins in a mini batch
preprocess_function_name – name of the preprocessing function to use, ‘histo’ by default. Can be that are listed under https://docs.prophesee.ai/stable/api/python/ml/preprocessing.html
delta_t – time interval between two consecutive frames
channels – number of channels in the input data, 2 by default for histograms
height – height of the input data
width – width of the input data
max_incr_per_pixel – maximum number of events per pixel
class_selection – list of classes to use
num_workers – number of workers for the dataloader
Nonhuman Primate Reaching
The Nonhuman Primate reaching Dataset consists of multi-channel recordings obtained from the sensorimotor cortex of two non-human primates (NHP) during self-paced reaching movements towards a grid of targets. The variable x is represented by threshold crossing times (or spike times) and sorted units for each of the recording channels. The target y is represented by 2-dimensional position coordinates of the fingertip of the reaching hand, sampled at a frequency of 250 Hz. The complete dataset contains 37 sessions spanning 10 months for NHP-1 and 10 sessions from NHP-2 spanning one month. For this study, three sessions from each NHP were selected to include the entire recording duration, resulting in a total of 6774 seconds of data.
This file contains code from PyTorch Vision (https://github.com/pytorch/vision) which is licensed under BSD 3-Clause License. These snippets are the Copyright (c) of Soumith Chintala 2016. All other code is the Copyright (c) of the NeuroBench Developers 2023.
- class neurobench.datasets.primate_reaching.PrimateReaching(*args: Any, **kwargs: Any)[source]
Bases:
NeuroBenchDataset
Dataset for the Primate Reaching Task.
The Dataset can be downloaded from the following website: https://zenodo.org/record/583331
For this task, the following files are selected: 1. indy_20170131_02.mat 2. indy_20160630_01.mat 3. indy_20160622_01.mat 4. loco_20170301_05.mat 5. loco_20170215_02.mat 6. loco_20170210_03.mat
The description of the structure of the dataset can be found on the website in the section: Variable names.
Once these .mat files are downloaded, store them in the same directory.
- __init__(file_path, filename, num_steps, train_ratio=0.8, label_series=False, biological_delay=0, spike_sorting=False, stride=0.004, bin_width=0.028, max_segment_length=2000, split_num=1, remove_segments_inactive=False, download=True)[source]
Initialises the Dataset for the Primate Reaching Task.
- Parameters:
file_path (str) – The path to the directory storing the matlab files.
filename (str) – The name of the file that will be loaded.
num_steps (int) – Number of consecutive timesteps that are included per sample. In the real-time case, this should be 1.
train_ratio (float) – ratio for how the dataset will be split into training/(val+test) set. Default is 0.8 (80% of data is training).
label_series (bool) – Whether the labels are series or not. Useful for training with multiple timesteps. Default is False.
biological_delay (int) – How many steps of delay is to be applied to the dataset. Default is 0 i.e. no delay applied.
spike_sorting (bool) – Apply spike sorting for processing raw spike data. Default is False.
stride (float) – How many steps are taken when moving the bin_window. Default is 0.004 (4ms).
bin_width (float) – The size of the bin_window. Default is 0.028 (28ms).
max_segment_length – Define the upper limits of a segment. Default is 2000 data points (8s)
split_num (int) – The number of chunks to break the timeseries into. Default is 1 (no splits).
remove_segments_inactive (bool) – Whether to remove segments longer than max_segment_length, which represent subject inactivity. Default is False.
download (bool) – If True, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it will not be downloaded again.
- apply_delay()[source]
Shift the labels by the delay to account for the biological delay between spikes and movement onset.
- load_data()[source]
Load the data from the matlab file and spike data if spike data has been processed and stored already.
- md5s = {'indy_20160622_01.mat': 'c33d5fff31320d709d23fe445561fb6e', 'indy_20160630_01.mat': '197413a5339630ea926cbd22b8b43338', 'indy_20170131_02.mat': '2790b1c869564afaa7772dbf9e42d784', 'loco_20170210_03.mat': '4cae63b58c4cb9c8abd44929216c703b', 'loco_20170215_02.mat': '739b70762d838f3a1f358733c426bb02', 'loco_20170301_05.mat': '47342da09f9c950050c9213c3df38ea3'}
- remove_segments_by_length()[source]
Remove the segments where its duration exceeds the limit set by max_segment_length.
- static split_into_segments(indices, last_idx)[source]
Combine the start and end index into a NumPy array.
- url = 'https://zenodo.org/record/583331/files/'
Mackey-Glass
The Mackey Glass dataset is synthetic and consists of a one-dimensional non-linear time delay differential equation, where the evolution of the signal can be altered by a number of different parameters. These parameters are defined in NeuroBench.
- class neurobench.datasets.mackey_glass.MackeyGlass(*args: Any, **kwargs: Any)[source]
Bases:
Dataset
Dataset for the Mackey-Glass task.
- __getitem__(idx)[source]
Getter method for dataset.
- Parameters:
idx (int or tensor) – index(s) of sample(s) to return
- Returns:
individual data sample, shape=(timestamps, features)=(1,1) target (tensor): corresponding next state of the system, shape=(label,)=(1,)
- Return type:
sample (tensor)
- __init__(file_path=None, tau=17, lyaptime=197, constant_past=0.7206597, nmg=10, beta=0.2, gamma=0.1, pts_per_lyaptime=75, traintime=10.0, testtime=10.0, start_offset=0.0, seed_id=0, bin_window=1, download=True)[source]
Initializes the Mackey-Glass dataset.
- Parameters:
file_path (str) – path to .npy file containing Mackey-Glass time-series. If this is provided, then tau, lyaptime, constant_past, nmg, beta, gamma are ignored.
tau (float) – parameter of the Mackey-Glass equation
lyaptime (float) – Lyapunov time of the time-series
constant_past (float) – initial condition for the solver
nmg (float) – parameter of the Mackey-Glass equation
beta (float) – parameter of the Mackey-Glass equation
gamma (float) – parameter of the Mackey-Glass equation
pts_per_lyaptime (int) – number of points to sample per one Lyapunov time
traintime (float) – number of Lyapunov times to be used for training a model
testtime (float) – number of Lyapunov times to be used for testing a model
start_offset (int) – added offset in number of points to shift the timeseries forward
seed_id (int) – seed for generating function solution
bin_window (int) – number of points forming lookback window for each prediction
download (bool) – If True, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it will not be downloaded again.
Multi-Lingual Spoken Word Corpus
MLCommons Multilingual Spoken Words Corpus is a large and growing audio dataset of spoken words in 50 languages for academic research and commercial applications in keyword spotting and spoken term search, licensed under CC-BY 4.0. The dataset contains more than 340,000 keywords, totaling 23.4 million 1-second spoken examples (over 6,000 hours).
The NeuroBench harness does not use the full MSWC dataset. For more information on the subset used, see the NeuroBench paper.
- class neurobench.datasets.MSWC_dataset.MSWC(root: str | Path, subset: str | None = None, procedure: str | None = None, language: str | None = None, incremental: bool | None = False, download=True)[source]
Bases:
Dataset
Subset version (https://huggingface.co/datasets/NeuroBench/mswc_fscil_subset) of the original MSWC dataset (https://mlcommons.org/en/multilingual-spoken-words/) for a few-shot class-incremental learning (FSCIL) task consisting of 200 voice commands keywords:
- 100 base classes available for pre-training with:
500 train samples
100 validation samples
100 test samples
100 evaluation classes to do class-incremental learning on with 200 samples each.
The subset of data used for this task, as well as the supporting files for base class and incremental splits, is hosted on Huggingface at the first link above.
The data is given in 48kHz opus format. Converted 16kHz wav files are available to download at the link above.
- __getitem__(index: int) Tuple[Tensor, int] [source]
Getter method to get waveform samples.
- Parameters:
idx (int) – Index of the sample.
- Returns:
Individual waveform sample, padded to always match dimension (48000, 1). target (int): Corresponding keyword index based on FSCIL_KEYWORDS order (by decreasing number of samples in original dataset).
- Return type:
sample (tensor)
- __init__(root: str | Path, subset: str | None = None, procedure: str | None = None, language: str | None = None, incremental: bool | None = False, download=True)[source]
Initialization will create the new base eval splits if needed .
- Parameters:
root (str) – Path of data root folder where is or will be the MSWC/ folder containing the dataset.
subset (str) – Return “base” or “evaluation” classes.
procedure (str) – For base subset, return “training”, “testing” or “validation” samples.
language (str) – Language to use for evaluation task.
download (bool) – If True, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it will not be downloaded again.
- class neurobench.datasets.MSWC_dataset.MSWC_query(walker)[source]
Bases:
Dataset
Simple Dataset object created for incremental queries.
- __getitem__(index: int)[source]
Getter method to get waveform samples.
- Parameters:
idx (int) – Index of the sample.
- Returns:
Individual waveform sample, padded to always match dimension (1, 48000). target (int): Corresponding keyword index based on FSCIL_KEYWORDS order (by decreasing number of samples in original dataset).
- Return type:
sample (tensor)
Wireless Sensor Data Mining
The “WISDM Smartphone and Smartwatch Activity and Biometrics Dataset” includes data collected from 51 subjects, each of whom were asked to perform 18 tasks for 3 minutes each. Each subject had a smartwatch placed on his/her dominant hand and a smartphone in their pocket. The data collection was controlled by a custom-made app that ran on the smartphone and smartwatch. The sensor data that was collected was from the accelerometer and gyrocope on both the smartphone and smartwatch, yielding four total sensors. The sensor data was collected at a rate of 20 Hz (i.e., every 50ms).