CORE-bench v2 Collection Benchmark for AI agents on scientific reproducibility — mainline (39) and OOD (19) splits derived from Code Ocean capsules. • 2 items • Updated 1 day ago