Benchmarks

Benchmark objects aim to provide a higher level interface to recreate the OOD detection benchmarks used in the literature.

API

Each benchmark implements a common interface.

Note

This is currently a draft and likely subject to change in the future.

benchmark = Benchmark(root)
detector = Detector(model)
detector.fit(benchmark.train_set())

results1 = benchmark.evaluate(detector1)
results2 = benchmark.evaluate(detector2)

Several detectors can also be evaluated together:

results = benchmark.evaluate(
    [detector1, detector2],
    cache=True,
    cache_dir="cache/",
    cache_key="wrn-cifar10-v1",
)

When possible, benchmarks reuse cached logits or pooled features for LogitsDetector and FeaturesDetector instances. With cache=True, those cached representations are kept on the benchmark object and can be reused across later evaluate(...) calls. With cache_dir=..., they can also be written to disk.

Warning

File-backed cache reuse is keyed only by the user-supplied cache_key and lightweight metadata. Users are responsible for changing the key when the model, weights, transforms, or benchmark configuration change.

class pytorch_ood.benchmark.Benchmark[source]

Base class for Benchmarks

evaluate(detector: Detector, loader_kwargs: Dict | None = None, device: str = 'cpu', cache: bool = False, cache_dir: str | None = None, cache_key: str | None = None) → List[Dict][source]

evaluate(detector: Sequence[Detector], loader_kwargs: Dict | None = None, device: str = 'cpu', cache: bool = False, cache_dir: str | None = None, cache_key: str | None = None) → List[Dict]

Evaluate one detector or a list of detectors on all benchmark datasets.

When several logits detectors or pooled-feature detectors are evaluated together, this method can reuse cached intermediate representations instead of recomputing model outputs for every detector. If cache=True, those representations are also kept on the benchmark instance and reused across later evaluate(...) calls. If cache_dir is given, cached tensors are additionally persisted to disk.

Disk-backed cache reuse is keyed only by user-provided cache_key and lightweight metadata, so cache correctness is the caller’s responsibility.

Parameters:

detector – detector instance or a sequence of detectors
loader_kwargs – keyword arguments forwarded to the data loader
device – device to move inputs and detectors to
cache – keep cached representations on the benchmark instance
cache_dir – optional directory for file-backed caches
cache_key – user-supplied cache key used for disk cache reuse

Returns:

benchmark results. For multiple detectors, each result includes a Detector field with the detector class name.

abstract test_sets(known=True, unknown=True) → List[Dataset][source]

List of the different test datasets. If known and unknown are true, each dataset contains ID and OOD data.

Parameters:

known – include ID
unknown – include OOD

abstract train_set() → Dataset[source]: Training dataset

Image

Examples can be found here

CIFAR 10

ODIN Benchmark

class pytorch_ood.benchmark.CIFAR10_ODIN(root, transform)[source]

Replicates the OOD detection benchmark from the ODIN paper for CIFAR 10.

See Paper:: ArXiv

Outlier datasets are

TinyImageNetCrop

TinyImageNetResize

LSUNResize

LSUNCrop

Uniform

Gaussian

Parameters:

root – where to store datasets
transform – transform to apply to images

ood_names: List[str]: OOD Dataset names

test_sets(known=True, unknown=True) → List[Dataset][source]

List of the different test datasets. If known and unknown are true, each dataset contains ID and OOD data.

Parameters:

known – include ID
unknown – include OOD

train_set() → Dataset[source]: Training dataset

OpenOOD Benchmark

class pytorch_ood.benchmark.CIFAR10_OpenOOD(root, transform)[source]

Replicates the CIFAR-10 benchmark proposed in OpenOOD v1.5: Enhanced Benchmark for Out-of-Distribution Detection.

See Paper:: OpenOOD v1.5

Near-OOD datasets:

CIFAR-100

TinyImageNet

Far-OOD datasets:

MNIST

SVHN

Textures

Places365

Parameters:

root – where to store datasets
transform – transform to apply to images

ood_names: List[str]: OOD Dataset names

test_sets(known=True, unknown=True) → List[Dataset][source]

List of the different test datasets. If known and unknown are true, each dataset contains ID and OOD data.

Parameters:

known – include ID
unknown – include OOD

train_set() → Dataset[source]: Training dataset

CIFAR 100

ODIN Benchmark

class pytorch_ood.benchmark.CIFAR100_ODIN(root, transform)[source]

Replicates the OOD detection benchmark from the ODIN paper for CIFAR 100.

See Paper:: ArXiv

Outlier datasets are

TinyImageNetCrop

TinyImageNetResize

LSUNResize

LSUNCrop

Uniform

Gaussian

Parameters:

root – where to store datasets
transform – transform to apply to images

ood_names: List[str]: OOD Dataset names

test_sets(known=True, unknown=True) → List[Dataset][source]

List of the different test datasets. If known and unknown are true, each dataset contains ID and OOD data.

Parameters:

known – include ID
unknown – include OOD

train_set() → Dataset[source]: Training dataset

OpenOOD Benchmark

class pytorch_ood.benchmark.CIFAR100_OpenOOD(root, transform)[source]

Aims to replicate the benchmark proposed in OpenOOD: Benchmarking Generalized Out-of-Distribution Detection.

See Paper:: OpenOOD

Outlier datasets are

CIFAR10

TinyImageNet

MNIST

FashionMNIST

Textures

Places365

Warning

This currently does not reproduce the benchmark accurately, as it does not exclude images with overlap with CIFAR100.

Parameters:

root – where to store datasets
transform – transform to apply to images

ood_names: List[str]: OOD Dataset names

test_sets(known=True, unknown=True) → List[Dataset][source]

List of the different test datasets. If known and unknown are true, each dataset contains ID and OOD data.

Parameters:

known – include ID
unknown – include OOD

train_set() → Dataset[source]: Training dataset

ImageNet

OpenOOD Benchmark

class pytorch_ood.benchmark.ImageNet_OpenOOD(root, image_net_root, transform)[source]

Replicates the ImageNet benchmark proposed in OpenOOD v1.5: Enhanced Benchmark for Out-of-Distribution Detection.

See Paper:: OpenOOD v1.5

Near-OOD datasets:

SSB-Hard

NINCO

Far-OOD datasets:

iNaturalist

Textures

OpenImage-O

Parameters:

root – where to store datasets
image_net_root – root for the ImageNet dataset
transform – transform to apply to images

ood_names: List[str]: OOD Dataset names

test_sets(known=True, unknown=True) → List[Dataset][source]

List of the different test datasets. If known and unknown are true, each dataset contains ID and OOD data.

Parameters:

known – include ID
unknown – include OOD

train_set() → Dataset[source]: Training dataset

Benchmark caching can reuse intermediate logits or pooled features when evaluating several compatible detectors on the same benchmark.

results = benchmark.evaluate(
    [detector1, detector2],
    cache=True,
    cache_dir="cache/",
    cache_key="wrn-cifar10-v1",
)

Compatible cached detector families in the current implementation are LogitsDetector and pooled FeaturesDetector instances. Detectors whose semantics depend on raw inputs, gradients, feature maps, or structured multi-layer representations automatically fall back to their standard predict(...) path. Mahalanobis with eps > 0 is treated as such a fallback case explicitly.

Warning

Disk-backed cache reuse is controlled by the user-provided cache_key. Cache validity is therefore the caller’s responsibility. Change the key whenever the model, weights, transforms, or benchmark setup change.