Patterns
โš–๏ธ

LLM as Judge(LJ)

Specific Producer-Critic implementation where an LLM acts as the critic to evaluate outputs

Complexity: lowReflection

๐ŸŽฏ 30-Second Overview

Pattern: Use LLMs to evaluate and score other LLM outputs at scale

Why: Automates quality assessment, enables best-of-N selection, scales evaluation

Key Insight: Specific implementation of Producer-Critic where the critic is an LLM with evaluation prompts

โšก Quick Implementation

1Define Criteria:Specify evaluation metrics & rubrics
2Design Prompt:Create structured judge prompt
3Feed Outputs:Pass generated content to judge
4Get Scores:Receive ratings/rankings/decisions
5Act on Results:Filter, rank, or select outputs
Example: outputs[n] โ†’ LLM_Judge(criteria) โ†’ scores[n] โ†’ select_best(scores)

๐Ÿ“‹ Do's & Don'ts

โœ…Use clear, objective evaluation criteria
โœ…Implement panel of judges for controversial decisions
โœ…Provide rubrics and examples in judge prompts
โœ…Use structured output formats (JSON) for scores
โœ…Validate judge consistency with known test cases
โŒUse vague criteria like "good" or "better"
โŒRely on single judge for critical decisions
โŒSkip calibration of judge against human evaluations
โŒUse same model for generation and judging (bias)
โŒIgnore judge uncertainty or confidence scores

๐Ÿšฆ When to Use

Use When

  • โ€ข Evaluating multiple outputs
  • โ€ข Quality assurance at scale
  • โ€ข A/B testing LLM responses
  • โ€ข Automated content moderation

Avoid When

  • โ€ข Subjective creative tasks
  • โ€ข Single output generation
  • โ€ข Real-time critical decisions
  • โ€ข Legal/medical assessments

๐Ÿ“Š Key Metrics

Judge Agreement
Correlation with human raters
Consistency
Same judgment on same input
Discrimination
Ability to rank quality differences
False Positive Rate
Good content marked as bad
Coverage
% of edge cases handled correctly
Speed
Evaluations per second

๐Ÿ’ก Top Use Cases

Output Selection: Generate 5 responses โ†’ Judge ranks โ†’ Return best to user
Quality Gates: Check if generated code/content meets standards before deployment
A/B Testing: Compare model versions by judging outputs on same prompts
Safety Filtering: Evaluate outputs for harmful/biased content before serving
RAG Evaluation: Judge relevance of retrieved documents to query

References & Further Reading

Deepen your understanding with these curated resources

Contribute to this collection

Know a great resource? Submit a pull request to add it.

Contribute

Patterns

closed

Loading...

Built by Kortexya