Loading...
METR RE-Bench(RE-Bench)
Benchmark for measuring performance of frontier model agents on ML research engineering tasks, comparing against human expert capabilities.
๐ฏ 30-Second Overview
Pattern: METR's benchmark comparing frontier AI agents against 71 human experts across 7 ML research engineering environments
Why: Evaluates R&D automation capabilities highlighted as key risk in frontier AI safety policies
Key Insight: Agents achieve 4x human performance at 2h budget but humans outperform 2x at 32h - time scaling matters
โก Quick Implementation
๐ Do's & Don'ts
๐ฆ When to Use
Use When
- โข Evaluating frontier AI R&D automation capabilities
- โข Measuring research engineering vs classical ML skills
- โข Assessing AI safety risks from autonomous R&D
- โข Comparing agent performance against human experts
- โข Research on AI-driven scientific discovery
Avoid When
- โข Standard ML benchmarking with public solutions
- โข Classical machine learning task evaluation
- โข Short-term capability assessment only
- โข Environments without GPU/compute resources
- โข Non-research engineering skill evaluation
๐ Key Metrics
๐ก Top Use Cases
References & Further Reading
Deepen your understanding with these curated resources
Conference & Academic Recognition
Contribute to this collection
Know a great resource? Submit a pull request to add it.
METR RE-Bench(RE-Bench)
Benchmark for measuring performance of frontier model agents on ML research engineering tasks, comparing against human expert capabilities.
๐ฏ 30-Second Overview
Pattern: METR's benchmark comparing frontier AI agents against 71 human experts across 7 ML research engineering environments
Why: Evaluates R&D automation capabilities highlighted as key risk in frontier AI safety policies
Key Insight: Agents achieve 4x human performance at 2h budget but humans outperform 2x at 32h - time scaling matters
โก Quick Implementation
๐ Do's & Don'ts
๐ฆ When to Use
Use When
- โข Evaluating frontier AI R&D automation capabilities
- โข Measuring research engineering vs classical ML skills
- โข Assessing AI safety risks from autonomous R&D
- โข Comparing agent performance against human experts
- โข Research on AI-driven scientific discovery
Avoid When
- โข Standard ML benchmarking with public solutions
- โข Classical machine learning task evaluation
- โข Short-term capability assessment only
- โข Environments without GPU/compute resources
- โข Non-research engineering skill evaluation
๐ Key Metrics
๐ก Top Use Cases
References & Further Reading
Deepen your understanding with these curated resources
Conference & Academic Recognition
Contribute to this collection
Know a great resource? Submit a pull request to add it.