Design Patterns & Techniques

🔗

Prompt Chaining

🔀

Routing

⚡

Parallelization

🪞

Reflection

🔧

Tool Use

🎯

Planning

👥

Multi-Agent

🧠

Memory Management

📈

Learning and Adaptation

🏗️

Fault Tolerance Infrastructure

📚

Knowledge Retrieval (RAG)

🧠

Reasoning Techniques

🔐

Security & Privacy Patterns

📊

Evaluation and Monitoring

🧠

Context Management

🎨

UI/UX & Human-AI Interaction

Loading...

🔬

METR RE-Bench(RE-Bench)

Benchmark for measuring performance of frontier model agents on ML research engineering tasks, comparing against human expert capabilities.

Complexity: highEvaluation and Monitoring

🎯 30-Second Overview

Pattern: METR's benchmark comparing frontier AI agents against 71 human experts across 7 ML research engineering environments

Why: Evaluates R&D automation capabilities highlighted as key risk in frontier AI safety policies

Key Insight: Agents achieve 4x human performance at 2h budget but humans outperform 2x at 32h - time scaling matters

⚡ Quick Implementation

1Setup:Deploy 7 ML research engineering environments with GPU access

2Configure:Set time budgets (2h, 8h, 32h) and scoring functions

3Evaluate:Run agents on scaling laws, GPU kernels, and ML optimization tasks

4Compare:Benchmark against 71 human expert attempts (61 distinct experts)

5Analyze:Assess R&D automation capabilities vs human performance

Example: re_bench = REBench(environments=7, experts=71, models=[claude_3.5, o1_preview], budget="8h")

📋 Do's & Don'ts

✅Test across all 7 environments for comprehensive R&D capability assessment

✅Compare performance at multiple time budgets (2h, 8h, 32h)

✅Focus on open-ended research engineering tasks, not classical ML

✅Provide GPU access for kernel optimization and scaling experiments

✅Use human expert baselines from 61 distinct researchers

❌Expect agents to outperform humans on extended time budgets (32h)

❌Focus only on short-term performance - humans show better scaling

❌Use publicly available solutions - benchmark designed to avoid this

❌Skip safety considerations for R&D automation capabilities

❌Ignore the 10x speed advantage agents have in solution generation

🚦 When to Use

Use When

• Evaluating frontier AI R&D automation capabilities
• Measuring research engineering vs classical ML skills
• Assessing AI safety risks from autonomous R&D
• Comparing agent performance against human experts
• Research on AI-driven scientific discovery

Avoid When

• Standard ML benchmarking with public solutions
• Classical machine learning task evaluation
• Short-term capability assessment only
• Environments without GPU/compute resources
• Non-research engineering skill evaluation

📊 Key Metrics

Expert Success Rate

82% of human experts achieved non-zero scores

Strong Solution Match

24% of experts matched or exceeded reference solutions

Short-term Agent Advantage

4x higher scores than humans at 2-hour budget

Long-term Human Advantage

2x human scores vs top agents at 32-hour budget

Solution Generation Speed

Agents generate/test solutions 10x faster than humans

Cost Efficiency

Much lower cost per solution attempt for AI agents

💡 Top Use Cases

AI Safety Research: Evaluating R&D automation risks as highlighted in frontier AI safety policies

Research Capability Assessment: Measuring agent performance on scaling laws, GPU kernel optimization

Human-AI Comparison: Direct benchmarking against 71 expert attempts across multiple time budgets

Autonomous R&D Evaluation: Testing frontier models (Claude 3.5 Sonnet, o1-preview) on research tasks

Policy and Governance: Supporting White House NSM and EU AI Act evaluations for R&D capabilities

References & Further Reading

Deepen your understanding with these curated resources

Official METR RE-Bench Resources

RE-Bench: Evaluating Frontier AI R&D Capabilities (arXiv:2411.15114)

METR Blog: Evaluating R&D Capabilities of LLMs

METR AI R&D Evaluation Report (PDF)

RE-Bench ResearchGate Publication

Conference & Academic Recognition

ICML 2025 Poster: RE-Bench

OpenReview: RE-Bench Discussion

Hugging Face Paper Page

Hacker News Discussion: RE-Bench

Related METR Research

METR: Measuring Autonomous AI Capabilities

KernelBench: GPU Kernel Engineering (arXiv:2502.10517)

METR Blog: Measuring Automated Kernel Engineering

Apollo Research: Forecasting Frontier Agent Capabilities

Policy & Safety Context

White House National Security Memorandum on AI

EU Artificial Intelligence Act

NVIDIA: GPU Kernel Generation with DeepSeek-R1

LessWrong: Forecasting Frontier Agent Capabilities

Contribute to this collection

Know a great resource? Submit a pull request to add it.

Contribute

🔬

METR RE-Bench(RE-Bench)

Benchmark for measuring performance of frontier model agents on ML research engineering tasks, comparing against human expert capabilities.

Complexity: highEvaluation and Monitoring

🎯 30-Second Overview

Pattern: METR's benchmark comparing frontier AI agents against 71 human experts across 7 ML research engineering environments

Why: Evaluates R&D automation capabilities highlighted as key risk in frontier AI safety policies

Key Insight: Agents achieve 4x human performance at 2h budget but humans outperform 2x at 32h - time scaling matters

⚡ Quick Implementation

1Setup:Deploy 7 ML research engineering environments with GPU access

2Configure:Set time budgets (2h, 8h, 32h) and scoring functions

3Evaluate:Run agents on scaling laws, GPU kernels, and ML optimization tasks

4Compare:Benchmark against 71 human expert attempts (61 distinct experts)

5Analyze:Assess R&D automation capabilities vs human performance

Example: re_bench = REBench(environments=7, experts=71, models=[claude_3.5, o1_preview], budget="8h")

📋 Do's & Don'ts

✅Test across all 7 environments for comprehensive R&D capability assessment

✅Compare performance at multiple time budgets (2h, 8h, 32h)

✅Focus on open-ended research engineering tasks, not classical ML

✅Provide GPU access for kernel optimization and scaling experiments

✅Use human expert baselines from 61 distinct researchers

❌Expect agents to outperform humans on extended time budgets (32h)

❌Focus only on short-term performance - humans show better scaling

❌Use publicly available solutions - benchmark designed to avoid this

❌Skip safety considerations for R&D automation capabilities

❌Ignore the 10x speed advantage agents have in solution generation

🚦 When to Use

Use When

• Evaluating frontier AI R&D automation capabilities
• Measuring research engineering vs classical ML skills
• Assessing AI safety risks from autonomous R&D
• Comparing agent performance against human experts
• Research on AI-driven scientific discovery

Avoid When

• Standard ML benchmarking with public solutions
• Classical machine learning task evaluation
• Short-term capability assessment only
• Environments without GPU/compute resources
• Non-research engineering skill evaluation

📊 Key Metrics

Expert Success Rate

82% of human experts achieved non-zero scores

Strong Solution Match

24% of experts matched or exceeded reference solutions

Short-term Agent Advantage

4x higher scores than humans at 2-hour budget

Long-term Human Advantage

2x human scores vs top agents at 32-hour budget

Solution Generation Speed

Agents generate/test solutions 10x faster than humans

Cost Efficiency

Much lower cost per solution attempt for AI agents

💡 Top Use Cases

AI Safety Research: Evaluating R&D automation risks as highlighted in frontier AI safety policies

Research Capability Assessment: Measuring agent performance on scaling laws, GPU kernel optimization

Human-AI Comparison: Direct benchmarking against 71 expert attempts across multiple time budgets

Autonomous R&D Evaluation: Testing frontier models (Claude 3.5 Sonnet, o1-preview) on research tasks

Policy and Governance: Supporting White House NSM and EU AI Act evaluations for R&D capabilities

References & Further Reading

Deepen your understanding with these curated resources

Official METR RE-Bench Resources

RE-Bench: Evaluating Frontier AI R&D Capabilities (arXiv:2411.15114)

METR Blog: Evaluating R&D Capabilities of LLMs

METR AI R&D Evaluation Report (PDF)

RE-Bench ResearchGate Publication

Conference & Academic Recognition

ICML 2025 Poster: RE-Bench

OpenReview: RE-Bench Discussion

Hugging Face Paper Page

Hacker News Discussion: RE-Bench

Related METR Research

METR: Measuring Autonomous AI Capabilities

KernelBench: GPU Kernel Engineering (arXiv:2502.10517)

METR Blog: Measuring Automated Kernel Engineering

Apollo Research: Forecasting Frontier Agent Capabilities

Policy & Safety Context

White House National Security Memorandum on AI

EU Artificial Intelligence Act

NVIDIA: GPU Kernel Generation with DeepSeek-R1

LessWrong: Forecasting Frontier Agent Capabilities

Contribute to this collection

Know a great resource? Submit a pull request to add it.

Contribute

Patterns

closed

Design Patterns & Techniques

🔗

Prompt Chaining

🔀

Routing

⚡

Parallelization

🪞

Reflection

🔧

Tool Use

🎯

Planning

👥

Multi-Agent

🧠

Memory Management

📈

Learning and Adaptation

🏗️

Fault Tolerance Infrastructure

📚

Knowledge Retrieval (RAG)

🧠

Reasoning Techniques

🔐

Security & Privacy Patterns

📊

Evaluation and Monitoring

🧠

Context Management

🎨

Agentic Design

Agentic Design

Design Patterns & Techniques

Prompt Chaining

Routing

Parallelization

Reflection

Tool Use

Planning

Multi-Agent

Memory Management

Learning and Adaptation

Fault Tolerance Infrastructure

Knowledge Retrieval (RAG)

Reasoning Techniques

Security & Privacy Patterns

Evaluation and Monitoring

MLCommons AI Safety Benchmark v1.0(AILuminate)

AgentBench(AgentBench)

TheAgentCompany Benchmark(TAC)

MLR-Bench(MLR-Bench)

12-Factor Agent Methodology(12FA)

HELM Agent Evaluation Framework(HELM-AE)

Human-in-the-Loop Agent (HULA)(HULA)

CybersecEval 3(CSE3)

METR RE-Bench(RE-Bench)

SWE-bench Suite(SWE-bench)

GAIA: General AI Assistants Benchmark(GAIA)

MMAU: Massive Multitask Agent Understanding(MMAU)

WebArena Evaluation Suite(WebArena)

EU AI Act Compliance Framework(EU-AIACT)

AISI Evaluation Framework(AISI-Eval)

MAPS: Multilingual Agent Performance & Security(MAPS)

Constitutional AI Evaluation Framework(CAI-Eval)

Context Management

UI/UX & Human-AI Interaction

Loading...

METR RE-Bench(RE-Bench)

🎯 30-Second Overview

⚡ Quick Implementation

📋 Do's & Don'ts

🚦 When to Use

Use When

Avoid When

📊 Key Metrics

💡 Top Use Cases

References & Further Reading

Official METR RE-Bench Resources

Conference & Academic Recognition

Related METR Research

Policy & Safety Context

Contribute to this collection

METR RE-Bench(RE-Bench)

🎯 30-Second Overview

⚡ Quick Implementation

📋 Do's & Don'ts

🚦 When to Use

Use When

Avoid When

📊 Key Metrics

💡 Top Use Cases

References & Further Reading

Official METR RE-Bench Resources

Conference & Academic Recognition

Related METR Research

Policy & Safety Context

Contribute to this collection

Patterns

Design Patterns & Techniques

Prompt Chaining

Routing

Parallelization

Reflection

Tool Use

Planning

Multi-Agent

Memory Management

Learning and Adaptation

Fault Tolerance Infrastructure

Knowledge Retrieval (RAG)