Benchmarks

Benchmark objects aim to provide a higher level interface to recreate the OOD detection benchmarks used in the literature.

API

Each benchmark implements a common interface.

Note

This is currently a draft and likely subject to change in the future.

benchmark = Benchmark(root)
detector = Detector(model)
detector.fit(benchmark.train_set())

results1 = benchmark.evaluate(detector1)
results2 = benchmark.evaluate(detector2)

Several detectors can also be evaluated together:

results = benchmark.evaluate(
    [detector1, detector2],
    cache=True,
    cache_dir="cache/",
    cache_key="wrn-cifar10-v1",
)

When possible, benchmarks reuse cached logits or pooled features for LogitsDetector and FeaturesDetector instances. With cache=True, those cached representations are kept on the benchmark object and can be reused across later evaluate(...) calls. With cache_dir=..., they can also be written to disk.

Warning

File-backed cache reuse is keyed only by the user-supplied cache_key and lightweight metadata. Users are responsible for changing the key when the model, weights, transforms, or benchmark configuration change.

class pytorch_ood.benchmark.Benchmark[source]

Base class for Benchmarks

evaluate(detector: Detector, loader_kwargs: Dict | None = None, device: str = 'cpu', cache: bool = False, cache_dir: str | None = None, cache_key: str | None = None) List[Dict][source]
evaluate(detector: Sequence[Detector], loader_kwargs: Dict | None = None, device: str = 'cpu', cache: bool = False, cache_dir: str | None = None, cache_key: str | None = None) List[Dict]

Evaluate one detector or a list of detectors on all benchmark datasets.

When several logits detectors or pooled-feature detectors are evaluated together, this method can reuse cached intermediate representations instead of recomputing model outputs for every detector. If cache=True, those representations are also kept on the benchmark instance and reused across later evaluate(...) calls. If cache_dir is given, cached tensors are additionally persisted to disk.

Disk-backed cache reuse is keyed only by user-provided cache_key and lightweight metadata, so cache correctness is the caller’s responsibility.

Parameters:
  • detector – detector instance or a sequence of detectors

  • loader_kwargs – keyword arguments forwarded to the data loader

  • device – device to move inputs and detectors to

  • cache – keep cached representations on the benchmark instance

  • cache_dir – optional directory for file-backed caches

  • cache_key – user-supplied cache key used for disk cache reuse

Returns:

benchmark results. For multiple detectors, each result includes a Detector field with the detector class name.

abstract test_sets(known=True, unknown=True) List[Dataset][source]

List of the different test datasets. If known and unknown are true, each dataset contains ID and OOD data.

Parameters:
  • known – include ID

  • unknown – include OOD

abstract train_set() Dataset[source]

Training dataset

Image

Examples can be found here

CIFAR 10

ODIN Benchmark

class pytorch_ood.benchmark.CIFAR10_ODIN(root, transform)[source]

Replicates the OOD detection benchmark from the ODIN paper for CIFAR 10.

See Paper:

ArXiv

Outlier datasets are

  • TinyImageNetCrop

  • TinyImageNetResize

  • LSUNResize

  • LSUNCrop

  • Uniform

  • Gaussian

Parameters:
  • root – where to store datasets

  • transform – transform to apply to images

ood_names: List[str]

OOD Dataset names

test_sets(known=True, unknown=True) List[Dataset][source]

List of the different test datasets. If known and unknown are true, each dataset contains ID and OOD data.

Parameters:
  • known – include ID

  • unknown – include OOD

train_set() Dataset[source]

Training dataset

OpenOOD Benchmark

class pytorch_ood.benchmark.CIFAR10_OpenOOD(root, transform)[source]

Replicates the CIFAR-10 benchmark proposed in OpenOOD v1.5: Enhanced Benchmark for Out-of-Distribution Detection.

See Paper:

OpenOOD v1.5

Near-OOD datasets:

  • CIFAR-100

  • TinyImageNet

Far-OOD datasets:

  • MNIST

  • SVHN

  • Textures

  • Places365

Parameters:
  • root – where to store datasets

  • transform – transform to apply to images

ood_names: List[str]

OOD Dataset names

test_sets(known=True, unknown=True) List[Dataset][source]

List of the different test datasets. If known and unknown are true, each dataset contains ID and OOD data.

Parameters:
  • known – include ID

  • unknown – include OOD

train_set() Dataset[source]

Training dataset

CIFAR 100

ODIN Benchmark

class pytorch_ood.benchmark.CIFAR100_ODIN(root, transform)[source]

Replicates the OOD detection benchmark from the ODIN paper for CIFAR 100.

See Paper:

ArXiv

Outlier datasets are

  • TinyImageNetCrop

  • TinyImageNetResize

  • LSUNResize

  • LSUNCrop

  • Uniform

  • Gaussian

Parameters:
  • root – where to store datasets

  • transform – transform to apply to images

ood_names: List[str]

OOD Dataset names

test_sets(known=True, unknown=True) List[Dataset][source]

List of the different test datasets. If known and unknown are true, each dataset contains ID and OOD data.

Parameters:
  • known – include ID

  • unknown – include OOD

train_set() Dataset[source]

Training dataset

OpenOOD Benchmark

class pytorch_ood.benchmark.CIFAR100_OpenOOD(root, transform)[source]

Aims to replicate the benchmark proposed in OpenOOD: Benchmarking Generalized Out-of-Distribution Detection.

See Paper:

OpenOOD

Outlier datasets are

  • CIFAR10

  • TinyImageNet

  • MNIST

  • FashionMNIST

  • Textures

  • Places365

Warning

This currently does not reproduce the benchmark accurately, as it does not exclude images with overlap with CIFAR100.

Parameters:
  • root – where to store datasets

  • transform – transform to apply to images

ood_names: List[str]

OOD Dataset names

test_sets(known=True, unknown=True) List[Dataset][source]

List of the different test datasets. If known and unknown are true, each dataset contains ID and OOD data.

Parameters:
  • known – include ID

  • unknown – include OOD

train_set() Dataset[source]

Training dataset

ImageNet

OpenOOD Benchmark

class pytorch_ood.benchmark.ImageNet_OpenOOD(root, image_net_root, transform)[source]

Replicates the ImageNet benchmark proposed in OpenOOD v1.5: Enhanced Benchmark for Out-of-Distribution Detection.

See Paper:

OpenOOD v1.5

Near-OOD datasets:

  • SSB-Hard

  • NINCO

Far-OOD datasets:

  • iNaturalist

  • Textures

  • OpenImage-O

Parameters:
  • root – where to store datasets

  • image_net_root – root for the ImageNet dataset

  • transform – transform to apply to images

ood_names: List[str]

OOD Dataset names

test_sets(known=True, unknown=True) List[Dataset][source]

List of the different test datasets. If known and unknown are true, each dataset contains ID and OOD data.

Parameters:
  • known – include ID

  • unknown – include OOD

train_set() Dataset[source]

Training dataset

Benchmark caching can reuse intermediate logits or pooled features when evaluating several compatible detectors on the same benchmark.

results = benchmark.evaluate(
    [detector1, detector2],
    cache=True,
    cache_dir="cache/",
    cache_key="wrn-cifar10-v1",
)

Compatible cached detector families in the current implementation are LogitsDetector and pooled FeaturesDetector instances. Detectors whose semantics depend on raw inputs, gradients, feature maps, or structured multi-layer representations automatically fall back to their standard predict(...) path. Mahalanobis with eps > 0 is treated as such a fallback case explicitly.

Warning

Disk-backed cache reuse is controlled by the user-provided cache_key. Cache validity is therefore the caller’s responsibility. Change the key whenever the model, weights, transforms, or benchmark setup change.