Evaluation

We use a small suite of metrics to check accuracy, calibration, and precision of approximate posteriors β€” across SBI and classical samplers.

How evaluation works

Submissions are evaluated against reference posteriors computed by the benchmark maintainers. In general, this means you provide posterior samples for each test observation, and we compare them to the reference using the metrics below.

The evaluation is designed to be method-agnostic: it should work for amortised SBI, hybrid SBI + MCMC, and traditional samplers alike.

If you want the submission workflow details, see Submit.

Metrics

No single number can certify a posterior. Different failure modes require different lenses: a method can produce samples that look like the reference on average while being systematically overconfident, or it can be calibrated but diffuse. We therefore organise evaluation around three complementary questions, each targeting a distinct failure mode.

βœ…Accuracy

Does the method get roughly the right answer? Measures whether approximate posterior samples are statistically indistinguishable from a reference.

C2ST
πŸ“Robustness / Calibration

Is the posterior well-calibrated β€” neither overconfident nor underconfident? Credible intervals should contain the truth at the right frequency.

Expected coverageP–P plots
🎯Precision

How close is the approximate posterior to the true one? Distributional distances give a finer-grained comparison beyond pass/fail accuracy.

KL divergenceJSDIS-ESS
βœ…

Accuracy

β€”classifier-based tests
Accuracy
C2ST β€” Classifier Two-Sample Test
Can a classifier tell your samples from the reference?

Train a binary classifier to separate approximate samples q^\hat{q} from the reference pp. If the two distributions match, the classifier can do no better than random guessing β€” its accuracy converges to 0.5.

Lopez-Paz & Oquab (2017)
πŸ“

Robustness / Calibration

β€”coverage checks
Calibration
Expected Coverage
Do credible intervals contain the truth at the right rate?
credibility levelexpected coverage

At nominal level Ξ±\alpha, the fraction of test events where the true parameter falls inside the credible region should equal Ξ±\alpha. A well-calibrated posterior traces the diagonal; the shaded gap reveals systematic over- or under-confidence.

Coverage(Ξ±)=1Nβˆ‘i1[ΞΈiβˆ—βˆˆCΞ±(xi)]\mathrm{Coverage}(\alpha)=\frac{1}{N}\sum_{i}\mathbb{1}[\theta_i^*\in C_\alpha(x_i)]
Lemos et al. (2023) β€” TARP
Calibration
P–P Plots
Are posterior rank statistics uniformly distributed?
pCDF(p)

For each parameter, rank the true value among posterior samples. Under a calibrated posterior these ranks follow a uniform distribution β€” their empirical CDF (solid) should trace the diagonal. Deviations reveal per-parameter miscalibration.

Talts et al. (2018) β€” Simulation-Based Calibration
🎯

Precision

β€”distributional distances
Precision
KL Divergence
Asymmetric information cost of using the approximation.

Measures the average extra bits needed to encode samples from pp when using q^\hat{q} instead. Asymmetric β€” KL(pβˆ₯q^)β‰ KL(q^βˆ₯p)\mathrm{KL}(p\|\hat{q})\neq\mathrm{KL}(\hat{q}\|p) β€” and unbounded when q^\hat{q} assigns zero mass to regions where p>0p>0.

KL(pβˆ₯q^)=Ep ⁣[log⁑p(ΞΈ)q^(ΞΈ)]\mathrm{KL}(p\|\hat{q})=\mathbb{E}_{p}\!\left[\log\frac{p(\theta)}{\hat{q}(\theta)}\right]
Kullback & Leibler (1951)
Precision
JSD β€” Jensen–Shannon
Symmetric, bounded divergence via a mixture midpoint.

The Jensen–Shannon divergence symmetrises KL by averaging the two directed divergences through the mixture M=(p+q^)/2M=(p+\hat{q})/2. Unlike KL it is bounded in [0,log⁑2][0,\log 2], making it easier to compare across different benchmarks and parameter spaces.

JSD(pβˆ₯q^)=12KL(pβˆ₯M)+12KL(q^βˆ₯M)M=p+q^2\begin{aligned} \mathrm{JSD}(p\|\hat{q}) &= \tfrac{1}{2}\mathrm{KL}(p\|M)+\tfrac{1}{2}\mathrm{KL}(\hat{q}\|M)\\[3pt] M &= \tfrac{p+\hat{q}}{2} \end{aligned}
Lin (1991)
Precision
IS-ESS
Effective sample size under importance reweighting.

Reweight samples by wi∝p(ΞΈi)/q^(ΞΈi)w_i\propto p(\theta_i)/\hat{q}(\theta_i). When q^β‰ˆp\hat{q}\approx p, all weights are roughly equal and ESSΒ β‰ˆΒ NN. Degenerate weights indicate poor overlap and distributional mismatch.

ESS=(βˆ‘iwi)2βˆ‘iwi2∈[1,N]\mathrm{ESS}=\frac{(\sum_i w_i)^2}{\sum_i w_i^2}\in[1,N]