Benchmark tasks

A levelled benchmark suite for simulation-based inference on gravitational-wave parameter estimation — starting simple, building toward production-style realism.

Task sheets, data releases, and evaluation servers are still being finalised. Nothing here should be read as a frozen specification until the first public package and leaderboard launch.

Three experimental tracks

Three distinct experimental regimes — each with its own detector geometry, data format, and inference challenges. We start with LVK; LISA and PTA follow as the benchmark matures.

LVK — ground-based

Active

Short strain bursts from Earth-based interferometers. The first benchmark levels live here: fast iteration, a shared simulator, and clear reference posteriors.

View track ↓

LISA — space-based

Planning

Heliocentric spacecraft constellation targeting massive black-hole binaries. Longer signals, higher SNR, a different sky-localisation geometry — Level 0 scope is under active design.

View track ↓

PTA — pulsar timing

Outlook

An ensemble of millisecond pulsars acting as a galaxy-scale detector. A fundamentally different data format — timing residuals and correlation matrices — with early benchmark ideas being explored.

View track ↓

What the benchmark is about

When two black holes merge, they produce a gravitational wave — a ripple in spacetime that stretches and compresses interferometer arms by a fraction of a proton diameter. Detectors record this as a noisy strain time series. The inference task is to recover the posterior distribution over source parameters (masses, spins, sky position, distance, orientation) given that data.

Because evaluating the exact likelihood requires expensive waveform simulations, the community invests heavily in simulation-based inference: train a neural network or other learned component once on simulations, then run inference on any new event in milliseconds. These benchmarks are designed to let researchers compare those methods on equal footing — standardised data formats, shared simulators, and blind evaluation against reference posteriors.

The level ladder starts with a clean five-parameter problem and progressively adds the complexity of real analyses. The same evaluation ideas extend to LISA (space-based) and PTA (pulsar timing) science, each of which brings a distinct set of inference challenges.

Physics concepts & glossary

Expand for illustrated explanations of all key terms — safe to skip for physicists.

Physics concepts

Gravitational waves

When two black holes spiral together, they emit ripples in spacetime. LIGO records these as a tiny strain h(t) that sweeps from low to high frequency as the inspiral speeds up — the "chirp" — before the final merger and ringdown.

Binary black hole system

The two black holes have masses m₁ ≥ m₂. The chirp mass Mc is the combination most precisely measured from the signal frequency evolution. The mass ratio q = m₂/m₁ ≤ 1 describes how symmetric the pair is.

Aligned spins

Each black hole carries spin. In the aligned-spin approximation used in Levels 0–1, spins can only point parallel or anti-parallel to the orbital axis. χ > 0 is aligned with the orbit; χ < 0 is anti-aligned. The fully precessing case is excluded here.

Intrinsic vs. extrinsic

Intrinsic parameters describe the binary itself (masses, spins, distance). Extrinsic parameters describe the viewing geometry from Earth: sky position, arrival time, phase, polarisation, and inclination. Level 0 fixes all extrinsic parameters.

Power spectral density

Each detector has a frequency-dependent noise floor S(f). Sensitivity peaks in the middle band — low frequencies are dominated by seismic noise; high frequencies by photon shot noise. A known PSD removes one axis of uncertainty from the benchmark.

Simulation-based inference

Because exact likelihood evaluation requires expensive waveform simulations, SBI methods train a neural network once on many simulated signals to approximate the posterior p(θ|d) directly. Classical nested samplers provide ground-truth reference posteriors for comparison.

Detectors, experiments & terms

LIGO — the US interferometers

LIGO (Laser Interferometer Gravitational-Wave Observatory) operates two identical 4 km L-shaped detectors: H1 in Hanford, Washington and L1 in Livingston, Louisiana. Each arm uses laser light to measure length changes a thousand times smaller than a proton.

LVK network — H1, L1, V1

LVK stands for LIGO–Virgo–KAGRA. Virgo (V1) is a 3 km detector near Pisa, Italy, operated by the European Gravitational Observatory. KAGRA (K1) is a 3 km detector in the Kamioka mine, Japan. Together, three or more detectors enable sky localisation by triangulation.

BBH & MBHB — source classes

A binary black hole (BBH) is the primary LVK source: two stellar-mass black holes of tens of solar masses. A massive black-hole binary (MBHB) involves black holes of millions to billions of solar masses — these are LISA's primary target and produce much longer, louder signals.

LISA — the space antenna

LISA (Laser Interferometer Space Antenna) is a planned ESA mission launching in the 2030s. Three spacecraft in a triangular constellation orbit the Sun at 2.5 million km arm length. It is sensitive to gravitational waves a million times lower in frequency than LIGO.

TDI — Time Delay Interferometry

In LISA, the three spacecraft cannot maintain equal arm lengths precisely, so raw laser signals cannot be combined like a standard interferometer. Time Delay Interferometry (TDI) is a post-processing technique that synthesises virtual equal-arm baselines from the recorded data. TDI 1.5 is the standard used in early LISA benchmark designs.

Bilby & nested sampling

Bilby is an open-source Python package for Bayesian gravitational-wave parameter estimation. It wraps several samplers — primarily nested sampling algorithms like Dynesty — that explore the posterior by gradually refining a set of live points. These classical samplers provide the reference posteriors against which SBI methods are scored.

LVK — ground-based track

Active

Each LVK observation is a short strain time series from three Earth-based interferometers (LIGO Hanford, LIGO Livingston, Virgo). The benchmark task: given that data, return samples from the posterior over the source parameters. We start here because the data format is well-understood, reference posteriors exist, and the inference problem has the right mix of difficulty for SBI methods.

Level 0 — fixed extrinsic

bbh-pe-l0

Five-dimensional posterior estimation over intrinsic parameters only.

The input to your model is a strain vector from three detectors (H1, L1, V1). Your task is to return samples from the posterior over five parameters: chirp mass (Mᶜ), mass ratio (q), luminosity distance (dₗ), and the two aligned spin components (χ₁, χ₂). All extrinsic parameters — sky location, geocentric arrival time, coalescence phase, polarisation, and inclination — are held at fixed, known values by the benchmark.

The noise is Gaussian with a fixed, known power spectral density, and the simulator is provided. Every participant uses the same forward model. This makes Level 0 a clean entry point: you can iterate on architectures and calibration without navigating the full extrinsic geometry. Both blind and unblind evaluation splits are available.

Level 0 specification
Aspect	Specification
Track slug	bbh-pe-l0
Source	Aligned-spin binary black hole (BBH)
Detectors	H1 · L1 · V1 (three-detector LVK network)
Inference target	Mᶜ, q, dₗ, χ₁, χ₂ (5 dimensions)
Extrinsic	Fixed — sky position, time, phase, ψ, θⱼⱼ held constant
Noise	Gaussian · fixed known PSD
Prior	Standard Bilby BBH prior (shipped with benchmark)
Simulator	Provided — shared forward model for all participants
Evaluation	Blind and unblind splits available

Level 1 — full parameter space

bbh-pe-l1

Eleven-dimensional inference — same setup, extrinsic parameters now free.

Same data format and noise contract as Level 0. The difference: six extrinsic parameters are no longer fixed. Sky position (RA, dec), geocentric arrival time, coalescence phase, polarisation angle (ψ), and inclination (θⱼⱼ) are now part of the posterior. Combined with the five intrinsic parameters from Level 0, the full inference target is eleven-dimensional.

The posterior structure becomes more complex. Sky position in a three-detector network produces a characteristic ring-shaped degeneracy on the sphere, and inclination and distance are correlated. Classical nested samplers handle Level 1 but need significantly more compute per event — amortised SBI methods can still evaluate any new event in a single forward pass. Exposing that cost contrast is exactly what Level 1 is designed to do.

Level 1 specification
Aspect	Specification
Track slug	bbh-pe-l1
Source	Same aligned-spin BBH class as Level 0
Detectors	Same H1 · L1 · V1 network
Inference target	Mᶜ, q, dₗ, χ₁, χ₂ + RA, dec, t_geo, φ, ψ, θⱼⱼ (11 dimensions)
Extrinsic	All six extrinsic parameters are free
Noise	Same fixed-PSD Gaussian model as Level 0
Prior	Same Bilby BBH prior family as Level 0
Simulator	Same shared forward model as Level 0
vs. Level 0	Larger parameter space · richer posterior geometry · more expensive reference posteriors

…

Coming next

Higher levels will introduce more pipeline freedom — varying noise realisations, more flexible simulators, and eventually production-style settings where methods must handle detector artifacts and non-Gaussian noise. The level ladder idea stays the same: each step adds one axis of realism.

LISA — space-based track

Planning

LISA is a triangular constellation of spacecraft in heliocentric orbit, sensitive to gravitational waves at frequencies far below what ground-based detectors can reach. Its primary target is massive black-hole binary (MBHB) mergers — sources with millions to billions of solar masses. The detector response, signal duration (weeks to months), and sky-localisation degeneracies are all qualitatively different from the LVK case, making LISA a genuinely distinct inference problem.

LISA Level 0 — scope under design

mbhb-pe-lisa-l0

MBHB inference with the reference LISA response: fixed noise, equal arm lengths, TDI 1.5.

Scope under design

The philosophy matches LVK Level 0: tackle the real problem in a controlled setting. Fix the noise to the reference LISA PSD, assume equal arm lengths and circular spacecraft orbits, use TDI 1.5 rather than full TDI 2.0, and exclude precession. This strips away tooling complexity while preserving what makes LISA inference genuinely different: the full LISA detector response function, signal durations of weeks to months, and characteristic multimodal sky posteriors that arise in every MBHB observation regardless of SNR.

Massive black-hole binary signals differ from LVK BBH in every relevant way — sources range from millions to billions of solar masses, SNR is much higher, and signals are much longer. The sky-localisation posterior is always multimodal by the geometry of a single orbiting constellation, and some degeneracies have no analogue in the ground-based case. LISA Level 0 will be a genuinely distinct benchmark, not a rescaled version of the LVK tracks.

…

Coming next

Further LISA levels and additional source classes (e.g. galactic binaries) will follow once Level 0 scope is finalised and tooling is in place.

PTA — pulsar timing

Outlook

Pulsar timing arrays monitor an ensemble of millisecond pulsars spread across the Milky Way. A stochastic gravitational-wave background imprints correlated timing residuals across pulsar pairs — the Hellings-Downs correlation — and the inference targets are the statistical properties of that background rather than individual merger events.

The data format is fundamentally different from interferometric data: correlation matrices and sky maps rather than strain time series. Early PTA benchmark ideas focus on self-contained classification tasks — for example, distinguishing isotropic from anisotropic gravitational-wave backgrounds — that do not require the full machinery of a production PTA analysis pipeline. Concrete benchmark designs are still being explored.