Gravitational Wave
SBI Benchmark

Standardised benchmarks for simulation-based inference on gravitational-wave parameter estimation.

Who is this benchmark for?

Built for researchers who develop inference methods and researchers who use them in gravitational-wave science — with a clear ladder from easy to more production-like settings.

For ML & SBI researchers

A standardised suite with multiple levels — so methods are comparable, and progress is measurable.

  • A reproducible, levelled benchmark to compare methods fairly and understand failure modes.
  • Evaluated with purpose-built, optimised metrics — accuracy, calibration, and robustness — against reference posteriors.
  • GW inference is a rich SBI playground: degeneracies, multimodal structure, and realistic simulator constraints.

For gravitational-wave scientists

Benchmarks close to your research that ramp from “easy to run” toward LVK/LISA-style parameter estimation.

  • Benchmarks that start simple but move toward LVK/LISA-style parameter estimation constraints.
  • Higher levels aim to be close to production workflows, so you can stress-test implementations for production readiness.
  • Transparent evaluation makes it easier to decide what to trust, and what needs work.

What you get

Everything you need to run, compare, and trust inference methods — on a problem that actually matters.

Automated blind evaluation

Submit posterior samples and receive scores automatically — blind, reproducible, and immediately comparable to every other submission.

A suite of optimised metrics

Accuracy, calibration, and robustness — a carefully chosen set that captures what actually matters, not just a single score.

A clear level ladder

Start simple for quick iteration and debugging; climb toward harder, production-style PE tasks as your method matures.

Real-world science

Gravitational-wave parameter estimation is a genuine, hard inverse problem — not a synthetic toy, but a benchmark that means something in practice.

ComparableCalibratedLevelledReal-world

Leading Results

Top submissions on Level 0 — aligned-spin BBH

#TeamMethodC2STExpected coverageKL divergenceJSDIS-ESS
1[Demo] Posterior PioneersNeural flow matching0.4800.0600.1100.0359.40e+2
2[Demo] Nested Nebula CrewMultiNest-style nested sampling0.5500.1100.2800.0924.10e+2
3[Demo] Slice & Dice InferenceMean-field VI + normalizing flow0.6200.1800.4500.1402.20e+2

Explore the Benchmark