neurobench.datasets
-------------------

Google Speech Commands
^^^^^^^^^^^^^^^^^^^^^^

The Google Speech Commands dataset (V2) is a commonly used dataset in assessing the performance of keyword spotting algorithms. 
The dataset consists of 105,829 1 second utterances of 35 different words from 2,618 distinct speakers. The data is encoded 
as linear 16-bit, single-channel, pulse code modulated values, at a 16 kHz sampling frequency.

.. automodule:: neurobench.datasets.speech_commands
    :special-members: __init__, __getitem__
    :members:
    :undoc-members:
    :show-inheritance:


Prophesee Megapixel Automotive
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The Prophesee 1 Megapixel Automotive Detection Dataset was recorded with a high-resolution event camera with a 110 degree field
of view mounted on a car windshield. The car was driven in various areas under different daytime weather conditions over several 
months. The dataset was labeled using the video stream of an additional RGB camera in a semi-automated way, resulting in over 25 
million bounding boxes for seven different object classes: pedestrian, two-wheeler, car, truck, bus, traffic sign, and traffic light.
The labels are provided at a rate of 60Hz, and the recording of 14.65 hours is split into 11.19, 2.21, and 2.25 hours for training, 
validation, and testing, respectively.

.. automodule:: neurobench.datasets.megapixel_automotive
    :special-members: __init__, __getitem__
    :members:
    :undoc-members:
    :show-inheritance:


Nonhuman Primate Reaching
^^^^^^^^^^^^^^^^^^^^^^^^^

The Nonhuman Primate reaching Dataset consists of multi-channel recordings obtained from the sensorimotor cortex of two non-human primates
(NHP) during self-paced reaching movements towards a grid of targets. The variable x is represented by threshold crossing times
(or spike times) and sorted units for each of the recording channels. The target y is represented by 2-dimensional position coordinates
of the fingertip of the reaching hand, sampled at a frequency of 250 Hz. The complete dataset contains 37 sessions spanning 10 months 
for NHP-1 and 10 sessions from NHP-2 spanning one month. For this study, three sessions from each NHP were selected to include the
entire recording duration, resulting in a total of 6774 seconds of data.

.. automodule:: neurobench.datasets.primate_reaching
    :special-members: __init__, __getitem__
    :members:
    :undoc-members:
    :show-inheritance:


Mackey-Glass
^^^^^^^^^^^^

The Mackey Glass dataset is synthetic and consists of a one-dimensional non-linear time delay differential equation, where 
the evolution of the signal can be altered by a number of different parameters. These parameters are defined in NeuroBench.

.. automodule:: neurobench.datasets.mackey_glass
    :special-members: __init__, __getitem__
    :members:
    :undoc-members:
    :show-inheritance:


Multi-Lingual Spoken Word Corpus
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

MLCommons Multilingual Spoken Words Corpus is a large and growing audio dataset of spoken words in 50 languages for academic
research and commercial applications in keyword spotting and spoken term search, licensed under CC-BY 4.0. The dataset contains
more than 340,000 keywords, totaling 23.4 million 1-second spoken examples (over 6,000 hours).

The NeuroBench harness does not use the full MSWC dataset. For more information on the subset used, see the NeuroBench paper.

.. automodule:: neurobench.datasets.MSWC_dataset
    :special-members: __init__, __getitem__
    :members:
    :undoc-members:
    :show-inheritance:


Wireless Sensor Data Mining
^^^^^^^^^^^^^^^^^^^^^^^^^^^

The "WISDM Smartphone and Smartwatch Activity and Biometrics Dataset" includes data
collected from 51 subjects, each of whom were asked to perform 18 tasks for 3 minutes
each. Each subject had a smartwatch placed on his/her dominant hand and a smartphone
in their pocket. The data collection was controlled by a custom-made app that ran
on the smartphone and smartwatch. The sensor data that was collected was from the
accelerometer and gyrocope on both the smartphone and smartwatch, yielding four 
total sensors. The sensor data was collected at a rate of 20 Hz (i.e., every 50ms).

.. automodule:: neurobench.datasets.WISDM
    :special-members: __init__, __getitem__
    :members:
    :undoc-members:
    :show-inheritance:


EEG MI
^^^^^^^
Preprocessed EEG Motor Imagery (MI) dataset derived from the Lee2019 dataset
(Lee et al., 2019, "EEG dataset and OpenBMI toolbox for three BCI paradigms:
An investigation into BCI illiteracy"), adapted for the THOR challenge.
Lee2019 consists of recorded EEG from 54 subjects performing left-hand and 
right-hand motor imagery tasks using a 62-channel cap at 1000 Hz.
The version of the dataset used in NeuroBench has been preprocessed to include 
62 channels with a sampling rate of 100 Hz, and is organized into 2.5-second 
trials for each subject. Both sessions of the original dataset are included, 
with 100 trials per session, resulting in a total of 200 trials per subject.


.. automodule:: neurobench.datasets.EEG
    :special-members: __init__, __getitem__
    :members:
    :undoc-members:
    :show-inheritance: