Who is this benchmark for?
Built for researchers who develop inference methods and researchers who use them in gravitational-wave science — with a clear ladder from easy to more production-like settings.
For ML & SBI researchers
A standardised suite with multiple levels — so methods are comparable, and progress is measurable.
- •A reproducible, levelled benchmark to compare methods fairly and understand failure modes.
- •Evaluated with purpose-built, optimised metrics — accuracy, calibration, and robustness — against reference posteriors.
- •GW inference is a rich SBI playground: degeneracies, multimodal structure, and realistic simulator constraints.
For gravitational-wave scientists
Benchmarks close to your research that ramp from “easy to run” toward LVK/LISA-style parameter estimation.
- •Benchmarks that start simple but move toward LVK/LISA-style parameter estimation constraints.
- •Higher levels aim to be close to production workflows, so you can stress-test implementations for production readiness.
- •Transparent evaluation makes it easier to decide what to trust, and what needs work.
What you get
Everything you need to run, compare, and trust inference methods — on a problem that actually matters.
Automated blind evaluation
Submit posterior samples and receive scores automatically — blind, reproducible, and immediately comparable to every other submission.
A suite of optimised metrics
Accuracy, calibration, and robustness — a carefully chosen set that captures what actually matters, not just a single score.
A clear level ladder
Start simple for quick iteration and debugging; climb toward harder, production-style PE tasks as your method matures.
Real-world science
Gravitational-wave parameter estimation is a genuine, hard inverse problem — not a synthetic toy, but a benchmark that means something in practice.
| # | Team | Method | C2ST↓ | Expected coverage↓ | KL divergence↓ | JSD↓ | IS-ESS↑ |
|---|---|---|---|---|---|---|---|
| 1 | [Demo] Posterior Pioneers | Neural flow matching | 0.480 | 0.060 | 0.110 | 0.035 | 9.40e+2 |
| 2 | [Demo] Nested Nebula Crew | MultiNest-style nested sampling | 0.550 | 0.110 | 0.280 | 0.092 | 4.10e+2 |
| 3 | [Demo] Slice & Dice Inference | Mean-field VI + normalizing flow | 0.620 | 0.180 | 0.450 | 0.140 | 2.20e+2 |