Patterns
๐Ÿข

TheAgentCompany Benchmark(TAC)

Benchmarks LLM agents on consequential real-world tasks that would typically be completed by multiple job roles in a software engineering company.

Complexity: highEvaluation and Monitoring

๐ŸŽฏ 30-Second Overview

Pattern: Simulated software company environment with 175 real-world professional tasks

Why: Evaluates enterprise-ready AI agents in realistic workplace scenarios with multi-tool collaboration

Key Insight: Current SOTA agents achieve ~30% completion rate, highlighting gaps in real-world deployment readiness

โšก Quick Implementation

1Clone:git clone https://github.com/TheAgentCompany/TheAgentCompany
2Setup:Docker environment with GitLab, OwnCloud, Plane, RocketChat
3Configure:Set up agent with browser, terminal, code editor access
4Select:Choose tasks from 175 professional scenarios
5Evaluate:Run checkpoint-based evaluation with partial credit
Example: python run_evaluation.py --agent gpt-4 --tasks software_engineering --mode full

๐Ÿ“‹ Do's & Don'ts

โœ…Test across multiple job roles (dev, QA, PM, admin, data science)
โœ…Use checkpoint-based evaluation for partial credit assessment
โœ…Enable agent communication with simulated colleagues
โœ…Leverage the self-hosted environment for reproducible results
โœ…Focus on long-horizon, multi-step professional tasks
โŒExpect high completion rates (current SOTA ~30%)
โŒSkip proper Docker environment setup and configuration
โŒIgnore partial credit scoring - many tasks show partial progress
โŒTest only on simple automation tasks
โŒOverlook agent-to-agent communication requirements

๐Ÿšฆ When to Use

Use When

  • โ€ข Evaluating enterprise-ready agent capabilities
  • โ€ข Testing work automation and digital worker scenarios
  • โ€ข Assessing real-world professional task performance
  • โ€ข Research on AI impact on labor markets
  • โ€ข Multi-step, long-horizon task evaluation

Avoid When

  • โ€ข Quick capability demos or simple task evaluation
  • โ€ข Resource-constrained environments (requires full Docker stack)
  • โ€ข Single-domain or narrow task assessment
  • โ€ข Real-time performance testing
  • โ€ข Academic-only or synthetic task evaluation

๐Ÿ“Š Key Metrics

Full Completion Rate
% of tasks completed successfully (typically ~30%)
Partial Credit Score
Weighted score including partial progress
Task Category Performance
Success rate by job role (dev, QA, PM, etc.)
Time to Completion
Average duration for successful task completion
Communication Effectiveness
Quality of agent-to-colleague interactions
Tool Usage Proficiency
Effective use of browser, terminal, code editor

๐Ÿ’ก Top Use Cases

Enterprise Readiness Assessment: Evaluating AI agents for real workplace deployment scenarios
Work Automation Research: Understanding which professional tasks can be automated with current AI
Digital Worker Development: Building and testing AI systems that collaborate with human colleagues
Economic Impact Analysis: Measuring AI capability progression for labor market policy research
Multi-Tool Agent Testing: Assessing agent proficiency across web browsers, terminals, and code editors

References & Further Reading

Deepen your understanding with these curated resources

Contribute to this collection

Know a great resource? Submit a pull request to add it.

Contribute

Patterns

closed

Loading...

Built by Kortexya