Evaluation

We use a small suite of metrics to check accuracy, calibration, and precision of approximate posteriors — across SBI and classical samplers.

How evaluation works

Submissions are evaluated against reference posteriors computed by the benchmark maintainers. In general, this means you provide posterior samples for each test observation, and we compare them to the reference using the metrics below.

The evaluation is designed to be method-agnostic: it should work for amortised SBI, hybrid SBI + MCMC, and traditional samplers alike.

If you want the submission workflow details, see Submit.

Metrics

No single number can certify a posterior. Different failure modes require different lenses: a method can produce samples that look like the reference on average while being systematically overconfident, or it can be calibrated but diffuse. We therefore organise evaluation around three complementary questions, each targeting a distinct failure mode.

✅Accuracy

Does the method get roughly the right answer? Measures whether approximate posterior samples are statistically indistinguishable from a reference.

C2ST

📐Robustness / Calibration

Is the posterior well-calibrated — neither overconfident nor underconfident? Credible intervals should contain the truth at the right frequency.

Expected coverageP–P plots

🎯Precision

How close is the approximate posterior to the true one? Distributional distances give a finer-grained comparison beyond pass/fail accuracy.

KL divergenceJSDIS-ESS

✅

Accuracy

—classifier-based tests

Accuracy

C2ST — Classifier Two-Sample Test

Can a classifier tell your samples from the reference?

Train a binary classifier to separate approximate samples $\hat{q}$ from the reference $p$ . If the two distributions match, the classifier can do no better than random guessing — its accuracy converges to 0.5.

Lopez-Paz & Oquab (2017)

📐

Robustness / Calibration

—coverage checks

Calibration

Expected Coverage

Do credible intervals contain the truth at the right rate?

At nominal level $\alpha$ , the fraction of test events where the true parameter falls inside the credible region should equal $\alpha$ . A well-calibrated posterior traces the diagonal; the shaded gap reveals systematic over- or under-confidence.

\mathrm{Coverage}(\alpha)=\frac{1}{N}\sum_{i}\mathbb{1}[\theta_i^*\in C_\alpha(x_i)]

Lemos et al. (2023) — TARP

Calibration

P–P Plots

Are posterior rank statistics uniformly distributed?

For each parameter, rank the true value among posterior samples. Under a calibrated posterior these ranks follow a uniform distribution — their empirical CDF (solid) should trace the diagonal. Deviations reveal per-parameter miscalibration.

Talts et al. (2018) — Simulation-Based Calibration

🎯

Precision

—distributional distances

Precision

KL Divergence

Asymmetric information cost of using the approximation.

Measures the average extra bits needed to encode samples from $p$ when using $\hat{q}$ instead. Asymmetric — $\mathrm{KL}(p\|\hat{q})\neq\mathrm{KL}(\hat{q}\|p)$ — and unbounded when $\hat{q}$ assigns zero mass to regions where $p>0$ .

\mathrm{KL}(p\|\hat{q})=\mathbb{E}_{p}\!\left[\log\frac{p(\theta)}{\hat{q}(\theta)}\right]

Kullback & Leibler (1951)

Precision

JSD — Jensen–Shannon

Symmetric, bounded divergence via a mixture midpoint.

The Jensen–Shannon divergence symmetrises KL by averaging the two directed divergences through the mixture $M=(p+\hat{q})/2$ . Unlike KL it is bounded in $[0,\log 2]$ , making it easier to compare across different benchmarks and parameter spaces.

\begin{aligned} \mathrm{JSD}(p\|\hat{q}) &= \tfrac{1}{2}\mathrm{KL}(p\|M)+\tfrac{1}{2}\mathrm{KL}(\hat{q}\|M)\\[3pt] M &= \tfrac{p+\hat{q}}{2} \end{aligned}

Lin (1991)

Precision

IS-ESS

Effective sample size under importance reweighting.

Reweight samples by $w_i\propto p(\theta_i)/\hat{q}(\theta_i)$ . When $\hat{q}\approx p$ , all weights are roughly equal and ESS ≈ $N$ . Degenerate weights indicate poor overlap and distributional mismatch.

\mathrm{ESS}=\frac{(\sum_i w_i)^2}{\sum_i w_i^2}\in[1,N]