Benchmarks

Benchmark objects aim to provide a higher level interface to recreate the OOD detection benchmarks used in the literature.

API

Each benchmark implements a common interface.

Note

This is currently a draft and likely subject to change in the future.

benchmark = Benchmark(root)
detector = Detector(model)
detector.fit(benchmark.train_set())

results1 = benchmark.evaluate(detector1)
results2 = benchmark.evaluate(detector2)

class pytorch_ood.benchmark.Benchmark[source]

Base class for Benchmarks

abstract evaluate(detector: Detector, *args, **kwargs) → List[Dict][source]: Evaluates the given detector on all datasets and returns a list with the results

abstract test_sets(known=True, unknown=True) → List[Dataset][source]

List of the different test datasets. If known and unknown are true, each dataset contains ID and OOD data.

Parameters:

known – include ID
unknown – include OOD

abstract train_set() → Dataset[source]: Training dataset

Image

Examples can be found here

CIFAR 10

ODIN Benchmark

class pytorch_ood.benchmark.CIFAR10_ODIN(root, transform)[source]

Replicates the OOD detection benchmark from the ODIN paper for CIFAR 10.

See Paper:: ArXiv

Outlier datasets are

TinyImageNetCrop

TinyImageNetResize

LSUNResize

LSUNCrop

Uniform

Gaussian

Parameters:

root – where to store datasets
transform – transform to apply to images

evaluate(detector: Detector, loader_kwargs: Dict | None = None, device: str = 'cpu') → List[Dict][source]

Evaluates the given detector on all datasets and returns a list with the results

Parameters:

detector – the detector to evaluate
loader_kwargs – keyword arguments to give to the data loader
device – the device to move batches to

ood_names: List[str]: OOD Dataset names

test_sets(known=True, unknown=True) → List[Dataset][source]

List of the different test datasets. If known and unknown are true, each dataset contains ID and OOD data.

Parameters:

known – include ID
unknown – include OOD

train_set() → Dataset[source]: Training dataset

OpenOOD Benchmark

class pytorch_ood.benchmark.CIFAR10_OpenOOD(root, transform)[source]

Aims to replicate the benchmark proposed in OpenOOD: Benchmarking Generalized Out-of-Distribution Detection.

See Paper:: OpenOOD

Outlier datasets are

CIFAR100

TinyImageNet

MNIST

FashionMNIST

Textures

Places365

Warning

This currently does not reproduce the benchmark accurately, as it does not exclude images with overlap with CIFAR10.

Parameters:

root – where to store datasets
transform – transform to apply to images

evaluate(detector: Detector, loader_kwargs: Dict | None = None, device: str = 'cpu') → List[Dict][source]

Evaluates the given detector on all datasets and returns a list with the results

Parameters:

detector – the detector to evaluate
loader_kwargs – keyword arguments to give to the data loader
device – the device to move batches to

ood_names: List[str]: OOD Dataset names

test_sets(known=True, unknown=True) → List[Dataset][source]

List of the different test datasets. If known and unknown are true, each dataset contains ID and OOD data.

Parameters:

known – include ID
unknown – include OOD

train_set() → Dataset[source]: Training dataset

CIFAR 100

ODIN Benchmark

class pytorch_ood.benchmark.CIFAR100_ODIN(root, transform)[source]

Replicates the OOD detection benchmark from the ODIN paper for CIFAR 100.

See Paper:: ArXiv

Outlier datasets are

TinyImageNetCrop

TinyImageNetResize

LSUNResize

LSUNCrop

Uniform

Gaussian

Parameters:

root – where to store datasets
transform – transform to apply to images

evaluate(detector: Detector, loader_kwargs: Dict | None = None, device: str = 'cpu') → List[Dict][source]

Evaluates the given detector on all datasets and returns a list with the results

Parameters:

detector – the detector to evaluate
loader_kwargs – keyword arguments to give to the data loader
device – the device to move batches to

ood_names: List[str]: OOD Dataset names

test_sets(known=True, unknown=True) → List[Dataset][source]

List of the different test datasets. If known and unknown are true, each dataset contains ID and OOD data.

Parameters:

known – include ID
unknown – include OOD

train_set() → Dataset[source]: Training dataset

OpenOOD Benchmark

class pytorch_ood.benchmark.CIFAR100_OpenOOD(root, transform)[source]

Aims to replicate the benchmark proposed in OpenOOD: Benchmarking Generalized Out-of-Distribution Detection.

See Paper:: OpenOOD

Outlier datasets are

CIFAR10

TinyImageNet

MNIST

FashionMNIST

Textures

Places365

Warning

This currently does not reproduce the benchmark accurately, as it does not exclude images with overlap with CIFAR100.

Parameters:

root – where to store datasets
transform – transform to apply to images

evaluate(detector: Detector, loader_kwargs: Dict | None = None, device: str = 'cpu') → List[Dict][source]

Evaluates the given detector on all datasets and returns a list with the results

Parameters:

detector – the detector to evaluate
loader_kwargs – keyword arguments to give to the data loader
device – the device to move batches to

ood_names: List[str]: OOD Dataset names

test_sets(known=True, unknown=True) → List[Dataset][source]

List of the different test datasets. If known and unknown are true, each dataset contains ID and OOD data.

Parameters:

known – include ID
unknown – include OOD

train_set() → Dataset[source]: Training dataset

ImageNet

OpenOOD Benchmark

class pytorch_ood.benchmark.ImageNet_OpenOOD(root, image_net_root, transform)[source]

Aims to replicate the ImageNet benchmark proposed in OpenOOD: Benchmarking Generalized Out-of-Distribution Detection.

See Paper:: OpenOOD

Outlier datasets are

ImageNet-O

OpenImage-O

Textures

MNIST

SVHN

Texture

Warning

This currently does not reproduce the benchmark accurately, as it does not exclude images with overlap with ImageNet and is missing the Species dataset.

Parameters:

root – where to store datasets
image_net_root – root for the ImageNet dataset
transform – transform to apply to images

evaluate(detector: Detector, loader_kwargs: Dict | None = None, device: str = 'cpu') → List[Dict][source]

Evaluates the given detector on all datasets and returns a list with the results

Parameters:

detector – the detector to evaluate
loader_kwargs – keyword arguments to give to the data loader
device – the device to move batches to

ood_names: List[str]: OOD Dataset names

test_sets(known=True, unknown=True) → List[Dataset][source]

List of the different test datasets. If known and unknown are true, each dataset contains ID and OOD data.

Parameters:

known – include ID
unknown – include OOD

train_set() → Dataset[source]: Training dataset