Patterns
๐Ÿ”ฌ

METR RE-Bench(RE-Bench)

Benchmark for measuring performance of frontier model agents on ML research engineering tasks, comparing against human expert capabilities.

Complexity: highEvaluation and Monitoring

๐ŸŽฏ 30-Second Overview

Pattern: METR's benchmark comparing frontier AI agents against 71 human experts across 7 ML research engineering environments

Why: Evaluates R&D automation capabilities highlighted as key risk in frontier AI safety policies

Key Insight: Agents achieve 4x human performance at 2h budget but humans outperform 2x at 32h - time scaling matters

โšก Quick Implementation

1Setup:Deploy 7 ML research engineering environments with GPU access
2Configure:Set time budgets (2h, 8h, 32h) and scoring functions
3Evaluate:Run agents on scaling laws, GPU kernels, and ML optimization tasks
4Compare:Benchmark against 71 human expert attempts (61 distinct experts)
5Analyze:Assess R&D automation capabilities vs human performance
Example: re_bench = REBench(environments=7, experts=71, models=[claude_3.5, o1_preview], budget="8h")

๐Ÿ“‹ Do's & Don'ts

โœ…Test across all 7 environments for comprehensive R&D capability assessment
โœ…Compare performance at multiple time budgets (2h, 8h, 32h)
โœ…Focus on open-ended research engineering tasks, not classical ML
โœ…Provide GPU access for kernel optimization and scaling experiments
โœ…Use human expert baselines from 61 distinct researchers
โŒExpect agents to outperform humans on extended time budgets (32h)
โŒFocus only on short-term performance - humans show better scaling
โŒUse publicly available solutions - benchmark designed to avoid this
โŒSkip safety considerations for R&D automation capabilities
โŒIgnore the 10x speed advantage agents have in solution generation

๐Ÿšฆ When to Use

Use When

  • โ€ข Evaluating frontier AI R&D automation capabilities
  • โ€ข Measuring research engineering vs classical ML skills
  • โ€ข Assessing AI safety risks from autonomous R&D
  • โ€ข Comparing agent performance against human experts
  • โ€ข Research on AI-driven scientific discovery

Avoid When

  • โ€ข Standard ML benchmarking with public solutions
  • โ€ข Classical machine learning task evaluation
  • โ€ข Short-term capability assessment only
  • โ€ข Environments without GPU/compute resources
  • โ€ข Non-research engineering skill evaluation

๐Ÿ“Š Key Metrics

Expert Success Rate
82% of human experts achieved non-zero scores
Strong Solution Match
24% of experts matched or exceeded reference solutions
Short-term Agent Advantage
4x higher scores than humans at 2-hour budget
Long-term Human Advantage
2x human scores vs top agents at 32-hour budget
Solution Generation Speed
Agents generate/test solutions 10x faster than humans
Cost Efficiency
Much lower cost per solution attempt for AI agents

๐Ÿ’ก Top Use Cases

AI Safety Research: Evaluating R&D automation risks as highlighted in frontier AI safety policies
Research Capability Assessment: Measuring agent performance on scaling laws, GPU kernel optimization
Human-AI Comparison: Direct benchmarking against 71 expert attempts across multiple time budgets
Autonomous R&D Evaluation: Testing frontier models (Claude 3.5 Sonnet, o1-preview) on research tasks
Policy and Governance: Supporting White House NSM and EU AI Act evaluations for R&D capabilities

References & Further Reading

Deepen your understanding with these curated resources

Contribute to this collection

Know a great resource? Submit a pull request to add it.

Contribute

Patterns

closed

Loading...

Built by Kortexya