Benchmarks
Benchmark objects aim to provide a higher level interface to recreate the OOD detection benchmarks used in the literature.
API
Each benchmark implements a common interface.
Note
This is currently a draft and likely subject to change in the future.
benchmark = Benchmark(root)
detector = Detector(model)
detector.fit(benchmark.train_set())
results1 = benchmark.evaluate(detector1)
results2 = benchmark.evaluate(detector2)
Several detectors can also be evaluated together:
results = benchmark.evaluate(
[detector1, detector2],
cache=True,
cache_dir="cache/",
cache_key="wrn-cifar10-v1",
)
When possible, benchmarks reuse cached logits or pooled features for
LogitsDetector and FeaturesDetector instances. With cache=True,
those cached representations are kept on the benchmark object and can be
reused across later evaluate(...) calls. With cache_dir=..., they
can also be written to disk.
Warning
File-backed cache reuse is keyed only by the user-supplied cache_key
and lightweight metadata. Users are responsible for changing the key when
the model, weights, transforms, or benchmark configuration change.
- class pytorch_ood.benchmark.Benchmark[source]
Base class for Benchmarks
- evaluate(detector: Detector, loader_kwargs: Dict | None = None, device: str = 'cpu', cache: bool = False, cache_dir: str | None = None, cache_key: str | None = None) List[Dict][source]
- evaluate(detector: Sequence[Detector], loader_kwargs: Dict | None = None, device: str = 'cpu', cache: bool = False, cache_dir: str | None = None, cache_key: str | None = None) List[Dict]
Evaluate one detector or a list of detectors on all benchmark datasets.
When several logits detectors or pooled-feature detectors are evaluated together, this method can reuse cached intermediate representations instead of recomputing model outputs for every detector. If
cache=True, those representations are also kept on the benchmark instance and reused across laterevaluate(...)calls. Ifcache_diris given, cached tensors are additionally persisted to disk.Disk-backed cache reuse is keyed only by user-provided
cache_keyand lightweight metadata, so cache correctness is the caller’s responsibility.- Parameters:
detector – detector instance or a sequence of detectors
loader_kwargs – keyword arguments forwarded to the data loader
device – device to move inputs and detectors to
cache – keep cached representations on the benchmark instance
cache_dir – optional directory for file-backed caches
cache_key – user-supplied cache key used for disk cache reuse
- Returns:
benchmark results. For multiple detectors, each result includes a
Detectorfield with the detector class name.
Image
Examples can be found here
CIFAR 10
ODIN Benchmark
- class pytorch_ood.benchmark.CIFAR10_ODIN(root, transform)[source]
Replicates the OOD detection benchmark from the ODIN paper for CIFAR 10.
- See Paper:
Outlier datasets are
TinyImageNetCrop
TinyImageNetResize
LSUNResize
LSUNCrop
Uniform
Gaussian
- Parameters:
root – where to store datasets
transform – transform to apply to images
- ood_names: List[str]
OOD Dataset names
OpenOOD Benchmark
- class pytorch_ood.benchmark.CIFAR10_OpenOOD(root, transform)[source]
Replicates the CIFAR-10 benchmark proposed in OpenOOD v1.5: Enhanced Benchmark for Out-of-Distribution Detection.
- See Paper:
Near-OOD datasets:
CIFAR-100
TinyImageNet
Far-OOD datasets:
MNIST
SVHN
Textures
Places365
- Parameters:
root – where to store datasets
transform – transform to apply to images
- ood_names: List[str]
OOD Dataset names
CIFAR 100
ODIN Benchmark
- class pytorch_ood.benchmark.CIFAR100_ODIN(root, transform)[source]
Replicates the OOD detection benchmark from the ODIN paper for CIFAR 100.
- See Paper:
Outlier datasets are
TinyImageNetCrop
TinyImageNetResize
LSUNResize
LSUNCrop
Uniform
Gaussian
- Parameters:
root – where to store datasets
transform – transform to apply to images
- ood_names: List[str]
OOD Dataset names
OpenOOD Benchmark
- class pytorch_ood.benchmark.CIFAR100_OpenOOD(root, transform)[source]
Aims to replicate the benchmark proposed in OpenOOD: Benchmarking Generalized Out-of-Distribution Detection.
- See Paper:
Outlier datasets are
CIFAR10
TinyImageNet
MNIST
FashionMNIST
Textures
Places365
Warning
This currently does not reproduce the benchmark accurately, as it does not exclude images with overlap with CIFAR100.
- Parameters:
root – where to store datasets
transform – transform to apply to images
- ood_names: List[str]
OOD Dataset names
ImageNet
OpenOOD Benchmark
- class pytorch_ood.benchmark.ImageNet_OpenOOD(root, image_net_root, transform)[source]
Replicates the ImageNet benchmark proposed in OpenOOD v1.5: Enhanced Benchmark for Out-of-Distribution Detection.
- See Paper:
Near-OOD datasets:
SSB-Hard
NINCO
Far-OOD datasets:
iNaturalist
Textures
OpenImage-O
- Parameters:
root – where to store datasets
image_net_root – root for the ImageNet dataset
transform – transform to apply to images
- ood_names: List[str]
OOD Dataset names
Benchmark caching can reuse intermediate logits or pooled features when evaluating several compatible detectors on the same benchmark.
results = benchmark.evaluate(
[detector1, detector2],
cache=True,
cache_dir="cache/",
cache_key="wrn-cifar10-v1",
)
Compatible cached detector families in the current implementation are
LogitsDetector and pooled FeaturesDetector instances. Detectors whose
semantics depend on raw inputs, gradients, feature maps, or structured
multi-layer representations automatically fall back to their standard
predict(...) path. Mahalanobis with eps > 0 is treated as such a
fallback case explicitly.
Warning
Disk-backed cache reuse is controlled by the user-provided cache_key.
Cache validity is therefore the caller’s responsibility. Change the key
whenever the model, weights, transforms, or benchmark setup change.