Evaluation
We use a small suite of metrics to check accuracy, calibration, and precision of approximate posteriors β across SBI and classical samplers.
How evaluation works
Submissions are evaluated against reference posteriors computed by the benchmark maintainers. In general, this means you provide posterior samples for each test observation, and we compare them to the reference using the metrics below.
The evaluation is designed to be method-agnostic: it should work for amortised SBI, hybrid SBI + MCMC, and traditional samplers alike.
If you want the submission workflow details, see Submit.
Metrics
No single number can certify a posterior. Different failure modes require different lenses: a method can produce samples that look like the reference on average while being systematically overconfident, or it can be calibrated but diffuse. We therefore organise evaluation around three complementary questions, each targeting a distinct failure mode.
Does the method get roughly the right answer? Measures whether approximate posterior samples are statistically indistinguishable from a reference.
Is the posterior well-calibrated β neither overconfident nor underconfident? Credible intervals should contain the truth at the right frequency.
How close is the approximate posterior to the true one? Distributional distances give a finer-grained comparison beyond pass/fail accuracy.