Design Patterns & Techniques

🔗

Prompt Chaining

🔀

Routing

⚡

Parallelization

🪞

Reflection

🔧

Tool Use

🎯

Planning

👥

Multi-Agent

🧠

Memory Management

📈

Learning and Adaptation

🏗️

Fault Tolerance Infrastructure

📚

Knowledge Retrieval (RAG)

🧠

Reasoning Techniques

🔐

Security & Privacy Patterns

📊

Evaluation and Monitoring

🧠

Context Management

🎨

UI/UX & Human-AI Interaction

Loading...

🎯

AgentBench(AgentBench)

The first comprehensive benchmark to evaluate LLMs as agents across 8 diverse environments, assessing reasoning and decision-making in multi-turn open-ended settings.

Complexity: highEvaluation and Monitoring

🎯 30-Second Overview

Pattern: First comprehensive benchmark evaluating LLMs as agents across 8 diverse environments

Why: Systematic assessment of reasoning, decision-making, and multi-turn interaction capabilities in realistic settings

Key Insight: Reveals significant performance gaps between commercial and open-source models in complex agent tasks

⚡ Quick Implementation

1Install:git clone https://github.com/THUDM/AgentBench

2Setup:Configure API keys and environment dependencies

3Select:Choose evaluation environments (1-8)

4Run:Execute benchmark against your LLM agent

5Analyze:Review performance across environments

Example: python run.py --model gpt-4 --environments all --output results.json

📋 Do's & Don'ts

✅Test across all 8 environments for comprehensive evaluation

✅Allow sufficient time for multi-turn interactions (4k-13k generations)

✅Compare results against established baselines (GPT-4, Claude)

✅Focus on long-term reasoning and decision-making capabilities

✅Analyze failure modes in complex multi-step tasks

❌Rely on single environment results for overall capability assessment

❌Skip proper environment setup and dependency configuration

❌Ignore instruction following quality in favor of task completion

❌Compare models without controlling for prompt engineering

❌Assume good performance in one domain transfers to others

🚦 When to Use

Use When

• Comprehensive LLM agent capability assessment
• Research on agent reasoning and decision-making
• Comparing multiple models across diverse tasks
• Identifying specific weaknesses in agent performance
• Academic research and model development

Avoid When

• Quick single-domain performance checks
• Resource-constrained environments (requires 4k-13k generations)
• Real-time evaluation needs
• Domain-specific benchmarking only
• Models without multi-turn conversation support

📊 Key Metrics

Overall Success Rate

Aggregate performance across all 8 environments

Per-Environment Score

Domain-specific capability assessment

Multi-turn Coherence

Consistency across conversation turns

Instruction Following

Adherence to task specifications

Long-term Reasoning

Performance on extended reasoning tasks

API vs OSS Gap

Commercial vs open-source model comparison

💡 Top Use Cases

LLM Agent Research: Comprehensive evaluation of reasoning and decision-making across diverse domains

Model Comparison: Systematic benchmarking of API-based vs open-source models (up to 70B parameters)

Capability Assessment: Identifying specific strengths/weaknesses in SQL, gaming, web, and OS environments

Academic Research: Supporting publications on agent capabilities and multi-turn interaction quality

Agent Development: Guiding improvements in long-term reasoning and instruction following

References & Further Reading

Deepen your understanding with these curated resources

Official Paper & Resources

AgentBench: Evaluating LLMs as Agents (Liu et al., ICLR 2024)

AgentBench GitHub Repository

ICLR 2024 Poster Presentation

OpenReview Discussion Forum

Implementation & Usage

AgentBench Dataset on Papers with Code

Hugging Face Paper Page

LLM Agent Benchmark Comparison List

AI Agent Benchmarks Overview (Evidently AI)

Related Research

Tool Learning with Foundation Models (Qin et al., 2023)

ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2022)

WebShop: Towards Scalable Real-World Web Interaction (Yao et al., 2022)

Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al., 2023)

Evaluation Frameworks

HELM: Holistic Evaluation of Language Models (Stanford CRFM)

BIG-bench: Beyond the Imitation Game Benchmark (Google)

LangChain Agent Evaluation Documentation

OpenAI Evals Framework

Contribute to this collection

Know a great resource? Submit a pull request to add it.

Contribute

🎯

AgentBench(AgentBench)

The first comprehensive benchmark to evaluate LLMs as agents across 8 diverse environments, assessing reasoning and decision-making in multi-turn open-ended settings.

Complexity: highEvaluation and Monitoring

🎯 30-Second Overview

Pattern: First comprehensive benchmark evaluating LLMs as agents across 8 diverse environments

Why: Systematic assessment of reasoning, decision-making, and multi-turn interaction capabilities in realistic settings

Key Insight: Reveals significant performance gaps between commercial and open-source models in complex agent tasks

⚡ Quick Implementation

1Install:git clone https://github.com/THUDM/AgentBench

2Setup:Configure API keys and environment dependencies

3Select:Choose evaluation environments (1-8)

4Run:Execute benchmark against your LLM agent

5Analyze:Review performance across environments

Example: python run.py --model gpt-4 --environments all --output results.json

📋 Do's & Don'ts

✅Test across all 8 environments for comprehensive evaluation

✅Allow sufficient time for multi-turn interactions (4k-13k generations)

✅Compare results against established baselines (GPT-4, Claude)

✅Focus on long-term reasoning and decision-making capabilities

✅Analyze failure modes in complex multi-step tasks

❌Rely on single environment results for overall capability assessment

❌Skip proper environment setup and dependency configuration

❌Ignore instruction following quality in favor of task completion

❌Compare models without controlling for prompt engineering

❌Assume good performance in one domain transfers to others

🚦 When to Use

Use When

• Comprehensive LLM agent capability assessment
• Research on agent reasoning and decision-making
• Comparing multiple models across diverse tasks
• Identifying specific weaknesses in agent performance
• Academic research and model development

Avoid When

• Quick single-domain performance checks
• Resource-constrained environments (requires 4k-13k generations)
• Real-time evaluation needs
• Domain-specific benchmarking only
• Models without multi-turn conversation support

📊 Key Metrics

Overall Success Rate

Aggregate performance across all 8 environments

Per-Environment Score

Domain-specific capability assessment

Multi-turn Coherence

Consistency across conversation turns

Instruction Following

Adherence to task specifications

Long-term Reasoning

Performance on extended reasoning tasks

API vs OSS Gap

Commercial vs open-source model comparison

💡 Top Use Cases

LLM Agent Research: Comprehensive evaluation of reasoning and decision-making across diverse domains

Model Comparison: Systematic benchmarking of API-based vs open-source models (up to 70B parameters)

Capability Assessment: Identifying specific strengths/weaknesses in SQL, gaming, web, and OS environments

Academic Research: Supporting publications on agent capabilities and multi-turn interaction quality

Agent Development: Guiding improvements in long-term reasoning and instruction following

References & Further Reading

Deepen your understanding with these curated resources

Official Paper & Resources

AgentBench: Evaluating LLMs as Agents (Liu et al., ICLR 2024)

AgentBench GitHub Repository

ICLR 2024 Poster Presentation

OpenReview Discussion Forum

Implementation & Usage

AgentBench Dataset on Papers with Code

Hugging Face Paper Page

LLM Agent Benchmark Comparison List

AI Agent Benchmarks Overview (Evidently AI)

Related Research

Tool Learning with Foundation Models (Qin et al., 2023)

ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2022)

WebShop: Towards Scalable Real-World Web Interaction (Yao et al., 2022)

Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al., 2023)

Evaluation Frameworks

HELM: Holistic Evaluation of Language Models (Stanford CRFM)

BIG-bench: Beyond the Imitation Game Benchmark (Google)

LangChain Agent Evaluation Documentation

OpenAI Evals Framework

Contribute to this collection

Know a great resource? Submit a pull request to add it.

Contribute

Patterns

closed

Design Patterns & Techniques

🔗

Prompt Chaining

🔀

Routing

⚡

Parallelization

🪞

Reflection

🔧

Tool Use

🎯

Planning

👥

Multi-Agent

🧠

Memory Management

📈

Learning and Adaptation

🏗️

Fault Tolerance Infrastructure

📚

Knowledge Retrieval (RAG)

🧠

Reasoning Techniques

🔐

Security & Privacy Patterns

📊

Evaluation and Monitoring

🧠

Context Management

🎨

Agentic Design

Agentic Design

Design Patterns & Techniques

Prompt Chaining

Routing

Parallelization

Reflection

Tool Use

Planning

Multi-Agent

Memory Management

Learning and Adaptation

Fault Tolerance Infrastructure

Knowledge Retrieval (RAG)

Reasoning Techniques

Security & Privacy Patterns

Evaluation and Monitoring

MLCommons AI Safety Benchmark v1.0(AILuminate)

AgentBench(AgentBench)

TheAgentCompany Benchmark(TAC)

MLR-Bench(MLR-Bench)

12-Factor Agent Methodology(12FA)

HELM Agent Evaluation Framework(HELM-AE)

Human-in-the-Loop Agent (HULA)(HULA)

CybersecEval 3(CSE3)

METR RE-Bench(RE-Bench)

SWE-bench Suite(SWE-bench)

GAIA: General AI Assistants Benchmark(GAIA)

MMAU: Massive Multitask Agent Understanding(MMAU)

WebArena Evaluation Suite(WebArena)

EU AI Act Compliance Framework(EU-AIACT)

AISI Evaluation Framework(AISI-Eval)

MAPS: Multilingual Agent Performance & Security(MAPS)

Constitutional AI Evaluation Framework(CAI-Eval)

Context Management

UI/UX & Human-AI Interaction

Loading...

AgentBench(AgentBench)

🎯 30-Second Overview

⚡ Quick Implementation

📋 Do's & Don'ts

🚦 When to Use

Use When

Avoid When

📊 Key Metrics

💡 Top Use Cases

References & Further Reading

Official Paper & Resources

Implementation & Usage

Related Research

Evaluation Frameworks

Contribute to this collection

AgentBench(AgentBench)

🎯 30-Second Overview

⚡ Quick Implementation

📋 Do's & Don'ts

🚦 When to Use

Use When

Avoid When

📊 Key Metrics

💡 Top Use Cases

References & Further Reading

Official Paper & Resources

Implementation & Usage

Related Research

Evaluation Frameworks

Contribute to this collection

Patterns

Design Patterns & Techniques

Prompt Chaining

Routing

Parallelization

Reflection

Tool Use

Planning

Multi-Agent

Memory Management

Learning and Adaptation

Fault Tolerance Infrastructure

Knowledge Retrieval (RAG)