Gravitational Wave
SBI Benchmark

Standardised benchmarks for simulation-based inference on gravitational-wave parameter estimation.

Who is this benchmark for?

Built for researchers who develop inference methods and researchers who use them in gravitational-wave science — with a clear ladder from easy to more production-like settings.

For ML & SBI researchers

A standardised suite with multiple levels — so methods are comparable, and progress is measurable.

•A reproducible, levelled benchmark to compare methods fairly and understand failure modes.
•Evaluated with purpose-built, optimised metrics — accuracy, calibration, and robustness — against reference posteriors.
•GW inference is a rich SBI playground: degeneracies, multimodal structure, and realistic simulator constraints.

For gravitational-wave scientists

Benchmarks close to your research that ramp from “easy to run” toward LVK/LISA-style parameter estimation.

•Benchmarks that start simple but move toward LVK/LISA-style parameter estimation constraints.
•Higher levels aim to be close to production workflows, so you can stress-test implementations for production readiness.
•Transparent evaluation makes it easier to decide what to trust, and what needs work.

What you get

Everything you need to run, compare, and trust inference methods — on a problem that actually matters.

Automated blind evaluation

Submit posterior samples and receive scores automatically — blind, reproducible, and immediately comparable to every other submission.

A suite of optimised metrics

Accuracy, calibration, and robustness — a carefully chosen set that captures what actually matters, not just a single score.

A clear level ladder

Start simple for quick iteration and debugging; climb toward harder, production-style PE tasks as your method matures.

Real-world science

Gravitational-wave parameter estimation is a genuine, hard inverse problem — not a synthetic toy, but a benchmark that means something in practice.

ComparableCalibratedLevelledReal-world

Leading Results

Top submissions on Level 0 — aligned-spin BBH

#	Team	Method	C2ST↓	Expected coverage↓	KL divergence↓	JSD↓	IS-ESS↑
1	[Demo] Posterior Pioneers	Neural flow matching	0.480	0.060	0.110	0.035	9.40e+2
2	[Demo] Nested Nebula Crew	MultiNest-style nested sampling	0.550	0.110	0.280	0.092	4.10e+2
3	[Demo] Slice & Dice Inference	Mean-field VI + normalizing flow	0.620	0.180	0.450	0.140	2.20e+2

Explore the Benchmark

Leaderboards

Explore results across all benchmark tasks and metrics.

Benchmarks

Learn about the inference tasks and data generation.

Evaluation

Metrics, scoring procedures, and calibration tests.

Submit

Submit your method and join the benchmark.