Custom Metrics

This guide explains how to create custom metrics using the NeuroBench framework. Metrics are categorized into two types: Static Metrics and Workload Metrics. Each type has a specific purpose and use case, and this document provides examples for defining metrics for each type.

Metric Types

The following metric types are available:

Static Metrics: Evaluate fixed properties of the model, such as the number of parameters.
Workload Metrics: Evaluate the model’s performance during inference execution.

By default, workload metrics will averaged across batched inference. Metrics like classification accuracy can be joined this way.

Other workload metrics may depend on all data and inferences and cannot be averaged across batches, such as an R^2 score. These workload metrics should be AccumulatedMetrics, which is a subclass of workload metrics which stores performance over multiple batches and computes final metric values once data is processed.

Static Metric Abstract Base Class

class neurobench.metrics.abstract.static_metric.StaticMetric[source]

Bases: ABC

Abstract base class for static metrics.

A static metric computes a value based on the model, and is designed to operate without the need for batch processing or iterative accumulation over multiple samples. This metric is expected to return a single value when called, which is typically computed over the entire model.

abstract __call__(model: NeuroBenchModel) → float[source]

Compute the static metric for the given model.

This abstract method must be implemented by any subclass to define how the metric should be computed based on the model.

Parameters:: model (NeuroBenchModel) – The model whose performance or properties the metric evaluates.
Returns:: The computed value of the metric for the model.
Return type:: float

Notes

The method should return a single float value representing the result of the metric computation. It does not require batch processing or tracking multiple samples.
This is intended for metrics that compute a single, static evaluation of a model, such as those based on its architecture, weights, or overall performance.

Workload Metric Abstract Base Class

class neurobench.metrics.abstract.workload_metric.WorkloadMetric(requires_hooks: bool = False)[source]

Bases: ABC

Abstract base class for workload metrics.

A workload metric is designed to evaluate some aspect of a model’s performance or behavior, typically during the inference phase, based on its predictions and input data. This class defines the basic interface for all workload metrics that require computation over batches of data.

requires_hooks

Flag indicating if the metric requires hooks for its computation.

Type:: bool

abstract __call__(model: NeuroBenchModel, preds: torch.Tensor, data: tuple[torch.Tensor, torch.Tensor]) → float[source]

Compute the workload metric.

This method must be implemented by any subclass to define how the metric should be computed based on the model, predictions, and data.

Parameters:

model (NeuroBenchModel) – The model whose performance is being evaluated.
preds (Tensor) – A tensor of model predictions.
data (tuple[Tensor, Tensor]) – A tuple containing the input data (Tensor)
labels (and the true)

Returns:

The computed value of the workload metric.

Return type:

float

__init__(requires_hooks: bool = False)[source]

Initialize the WorkloadMetric.

Parameters:: requires_hooks (bool, default=False) – Flag indicating if the metric requires hooks

Accumulated Metric Abstract Base Class

class neurobench.metrics.abstract.workload_metric.AccumulatedMetric(requires_hooks: bool = False)[source]

Bases: WorkloadMetric

Abstract base class for accumulated workload metrics.

An accumulated metric tracks values over multiple batches or iterations and computes the final metric value after accumulating data. It extends the WorkloadMetric class and adds functionality for resetting and computing the accumulated metric over time.

__init__(requires_hooks: bool = False)[source]

Initialize the AccumulatedMetric.

Parameters:: requires_hooks (bool, default=False) – Flag indicating if the metric requires hooks

abstract compute() → float[source]

Compute the accumulated metric.

This method must be implemented by any subclass to compute the accumulated value of the metric, typically after processing multiple batches.

Returns:: The computed accumulated metric value.
Return type:: float

abstract reset() → None[source]

Reset the accumulated state.

This method must be implemented by any subclass to reset the metric’s accumulated state.

Defining Custom Metrics

Here’s how you can define your own metrics for each type:

Static Metrics

Static metrics are used to evaluate fixed properties of a model, such as the total number of parameters.

Example:

from neurobench.metrics.abstract import StaticMetric

class ParameterCountMetric(StaticMetric):
    """
    Metric to count the total number of parameters in a model.
    """

    def __call__(self, model):
        return sum(p.numel() for p in model.parameters())

Workload Metrics

Workload metrics (which are not Accumulated) evaluate the model’s performance for each batch of input data, and the result is averaged across batches.

Example:

from neurobench.metrics.abstract import WorkloadMetric

class AccuracyMetric(WorkloadMetric):
    """
    Metric to compute accuracy for a single batch.
    """

    def __call__(self, model, preds, data):
        inputs, labels = data
        correct = (preds.argmax(dim=1) == labels).sum().item()
        return correct / len(labels) * 100

Accumulated Metrics

Accumulated metrics are a type of workload metric which stores performance information over multiple batches and compute a final result. These should be used when the metric should not be averaged across batches.

Example:

from neurobench.metrics.abstract import AccumulatedMetric

class MaximumLossMetric(AccumulatedMetric):
    """
    Metric to compute the maximum loss over multiple batches.
    """

    def __init__(self):
        super().__init__()
        self.max_loss = 0.0
        self.num_batches = 0

    def __call__(self, model, preds, data):
        _, labels = data
        loss = loss_function(preds, labels).item()
        if loss > self.max_loss:
            self.max_loss = loss
        self.num_batches += 1

    def compute(self):
        return self.max_loss

    def reset(self):
        self.max_loss = 0.0
        self.num_batches = 0

Plugging Metrics into the Benchmark

The Benchmark class expects metrics to be passed as lists grouped by type. You also need to configure any postprocessors if required by your model.

Here’s an example of integrating both built-in and custom metrics into the Benchmark class:

Example:

from neurobench.metrics import Benchmark
from neurobench.postprocessors import ChooseMaxCount

# Define datasets and the model
model = Model()  # Replace with your model instance
test_set_loader = DataLoader()  # Replace with your DataLoader instance

# Define postprocessors (if applicable)
postprocessors = [ChooseMaxCount()]

# Define metrics
static_metrics = [ParameterCountMetric] # Replace with your custom static metric. Do not initialize the classes.
workload_metrics = [AccuracyMetric, MaximumLossMetric]  # Replace with your custom workload and accumulated metrics. Do not initialize the classes.

# Create the Benchmark instance
benchmark = Benchmark(
    model=model,
    dataloader=test_set_loader,
    postprocessors=postprocessors,
    metrics=[static_metrics, workload_metrics]
)

# Run the benchmark
results = benchmark.run(verbose=True)

# Access results
print(results)

Summary of Custom Metrics

The following table summarizes the available metric types:

Metric Types
Metric Type	Base Class	Key Methods	Use Case
Static Metric	`StaticMetric`	`__call__`	Evaluate static properties of the model.
Workload Metric	`WorkloadMetric`	`__call__`	Averages over batched inference.
Accumulated Metric	`AccumulatedMetric`	`__call__`, `compute`, `reset`	Accumulates performance info over multiple batches.