Design Patterns & Techniques

🔗

Prompt Chaining

🔀

Routing

⚡

Parallelization

🪞

Reflection

🔧

Tool Use

🎯

Planning

👥

Multi-Agent

🧠

Memory Management

📈

Learning and Adaptation

🏗️

Fault Tolerance Infrastructure

📚

Knowledge Retrieval (RAG)

🧠

Reasoning Techniques

🔐

Security & Privacy Patterns

📊

Evaluation and Monitoring

🧠

Context Management

🎨

UI/UX & Human-AI Interaction

Loading...

⚖️

LLM as Judge(LJ)

Specific Producer-Critic implementation where an LLM acts as the critic to evaluate outputs

Complexity: lowReflection

🎯 30-Second Overview

Pattern: Use LLMs to evaluate and score other LLM outputs at scale

Why: Automates quality assessment, enables best-of-N selection, scales evaluation

Key Insight: Specific implementation of Producer-Critic where the critic is an LLM with evaluation prompts

⚡ Quick Implementation

1Define Criteria:Specify evaluation metrics & rubrics

2Design Prompt:Create structured judge prompt

3Feed Outputs:Pass generated content to judge

4Get Scores:Receive ratings/rankings/decisions

5Act on Results:Filter, rank, or select outputs

Example: outputs[n] → LLM_Judge(criteria) → scores[n] → select_best(scores)

📋 Do's & Don'ts

✅Use clear, objective evaluation criteria

✅Implement panel of judges for controversial decisions

✅Provide rubrics and examples in judge prompts

✅Use structured output formats (JSON) for scores

✅Validate judge consistency with known test cases

❌Use vague criteria like "good" or "better"

❌Rely on single judge for critical decisions

❌Skip calibration of judge against human evaluations

❌Use same model for generation and judging (bias)

❌Ignore judge uncertainty or confidence scores

🚦 When to Use

Use When

• Evaluating multiple outputs
• Quality assurance at scale
• A/B testing LLM responses
• Automated content moderation

Avoid When

• Subjective creative tasks
• Single output generation
• Real-time critical decisions
• Legal/medical assessments

📊 Key Metrics

Judge Agreement

Correlation with human raters

Consistency

Same judgment on same input

Discrimination

Ability to rank quality differences

False Positive Rate

Good content marked as bad

Coverage

% of edge cases handled correctly

Speed

Evaluations per second

💡 Top Use Cases

Output Selection: Generate 5 responses → Judge ranks → Return best to user

Quality Gates: Check if generated code/content meets standards before deployment

A/B Testing: Compare model versions by judging outputs on same prompts

Safety Filtering: Evaluate outputs for harmful/biased content before serving

RAG Evaluation: Judge relevance of retrieved documents to query

References & Further Reading

Deepen your understanding with these curated resources

Academic Papers

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023)

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment (Liu et al., 2023)

Large Language Models are not Fair Evaluators (Wang et al., 2023)

Prometheus: Inducing Fine-grained Evaluation Capability in LLMs (Kim et al., 2024)

Implementation Guides

OpenAI Evals Framework Documentation

Anthropic Model Evaluation Guide

LangChain LLM Evaluation Documentation

MLflow LLM Judge Implementation

Tools & Libraries

Evidently AI - LLM Evaluation Framework

DeepEval - LLM Evaluation Framework

RAGAS - RAG Evaluation Framework

Arthur Bench - LLM Evaluation Suite

Related Patterns

Producer-Critic Pattern (Parent Architecture)

Constitutional AI (Uses LLM Judges)

RLAIF - RL from AI Feedback

Best-of-N Sampling with LLM Ranking

Contribute to this collection

Know a great resource? Submit a pull request to add it.

Contribute

⚖️

LLM as Judge(LJ)

Specific Producer-Critic implementation where an LLM acts as the critic to evaluate outputs

Complexity: lowReflection

🎯 30-Second Overview

Pattern: Use LLMs to evaluate and score other LLM outputs at scale

Why: Automates quality assessment, enables best-of-N selection, scales evaluation

Key Insight: Specific implementation of Producer-Critic where the critic is an LLM with evaluation prompts

⚡ Quick Implementation

1Define Criteria:Specify evaluation metrics & rubrics

2Design Prompt:Create structured judge prompt

3Feed Outputs:Pass generated content to judge

4Get Scores:Receive ratings/rankings/decisions

5Act on Results:Filter, rank, or select outputs

Example: outputs[n] → LLM_Judge(criteria) → scores[n] → select_best(scores)

📋 Do's & Don'ts

✅Use clear, objective evaluation criteria

✅Implement panel of judges for controversial decisions

✅Provide rubrics and examples in judge prompts

✅Use structured output formats (JSON) for scores

✅Validate judge consistency with known test cases

❌Use vague criteria like "good" or "better"

❌Rely on single judge for critical decisions

❌Skip calibration of judge against human evaluations

❌Use same model for generation and judging (bias)

❌Ignore judge uncertainty or confidence scores

🚦 When to Use

Use When

• Evaluating multiple outputs
• Quality assurance at scale
• A/B testing LLM responses
• Automated content moderation

Avoid When

• Subjective creative tasks
• Single output generation
• Real-time critical decisions
• Legal/medical assessments

📊 Key Metrics

Judge Agreement

Correlation with human raters

Consistency

Same judgment on same input

Discrimination

Ability to rank quality differences

False Positive Rate

Good content marked as bad

Coverage

% of edge cases handled correctly

Speed

Evaluations per second

💡 Top Use Cases

Output Selection: Generate 5 responses → Judge ranks → Return best to user

Quality Gates: Check if generated code/content meets standards before deployment

A/B Testing: Compare model versions by judging outputs on same prompts

Safety Filtering: Evaluate outputs for harmful/biased content before serving

RAG Evaluation: Judge relevance of retrieved documents to query

References & Further Reading

Deepen your understanding with these curated resources

Academic Papers

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023)

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment (Liu et al., 2023)

Large Language Models are not Fair Evaluators (Wang et al., 2023)

Prometheus: Inducing Fine-grained Evaluation Capability in LLMs (Kim et al., 2024)

Implementation Guides

OpenAI Evals Framework Documentation

Anthropic Model Evaluation Guide

LangChain LLM Evaluation Documentation

MLflow LLM Judge Implementation

Tools & Libraries

Evidently AI - LLM Evaluation Framework

DeepEval - LLM Evaluation Framework

RAGAS - RAG Evaluation Framework

Arthur Bench - LLM Evaluation Suite

Related Patterns

Producer-Critic Pattern (Parent Architecture)

Constitutional AI (Uses LLM Judges)

RLAIF - RL from AI Feedback

Best-of-N Sampling with LLM Ranking

Contribute to this collection

Know a great resource? Submit a pull request to add it.

Contribute

Patterns

closed

Design Patterns & Techniques

🔗

Prompt Chaining

🔀

Routing

⚡

Parallelization

🪞

Reflection

🔧

Tool Use

🎯

Planning

👥

Multi-Agent

🧠

Memory Management

📈

Learning and Adaptation

🏗️

Fault Tolerance Infrastructure

📚

Knowledge Retrieval (RAG)

🧠

Reasoning Techniques

🔐

Security & Privacy Patterns

📊

Evaluation and Monitoring

🧠

Context Management

🎨

Agentic Design

Agentic Design

Design Patterns & Techniques

Prompt Chaining

Routing

Parallelization

Reflection

Self-Critique(SC)

Producer-Critic Pattern(PC)

LLM as Judge(LJ)

Reflexion(RX)

Tool Use

Planning

Multi-Agent

Memory Management

Learning and Adaptation

Fault Tolerance Infrastructure

Knowledge Retrieval (RAG)

Reasoning Techniques

Security & Privacy Patterns

Evaluation and Monitoring

Context Management

UI/UX & Human-AI Interaction

Loading...

LLM as Judge(LJ)

🎯 30-Second Overview

⚡ Quick Implementation

📋 Do's & Don'ts

🚦 When to Use

Use When

Avoid When

📊 Key Metrics

💡 Top Use Cases

References & Further Reading

Academic Papers

Implementation Guides

Tools & Libraries

Related Patterns

Contribute to this collection

LLM as Judge(LJ)

🎯 30-Second Overview

⚡ Quick Implementation

📋 Do's & Don'ts

🚦 When to Use

Use When

Avoid When

📊 Key Metrics

💡 Top Use Cases

References & Further Reading

Academic Papers

Implementation Guides

Tools & Libraries

Related Patterns

Contribute to this collection

Patterns

Design Patterns & Techniques

Prompt Chaining

Routing

Parallelization

Reflection

Self-Critique(SC)

Producer-Critic Pattern(PC)

LLM as Judge(LJ)

Reflexion(RX)

Tool Use

Planning

Multi-Agent

Memory Management

Learning and Adaptation

Fault Tolerance Infrastructure

Knowledge Retrieval (RAG)

Reasoning Techniques

Security & Privacy Patterns

Evaluation and Monitoring

Context Management

UI/UX & Human-AI Interaction

Loading...

Design Patterns & Techniques

Prompt Chaining

Routing