Loading...
LLM as Judge(LJ)
Specific Producer-Critic implementation where an LLM acts as the critic to evaluate outputs
๐ฏ 30-Second Overview
Pattern: Use LLMs to evaluate and score other LLM outputs at scale
Why: Automates quality assessment, enables best-of-N selection, scales evaluation
Key Insight: Specific implementation of Producer-Critic where the critic is an LLM with evaluation prompts
โก Quick Implementation
๐ Do's & Don'ts
๐ฆ When to Use
Use When
- โข Evaluating multiple outputs
- โข Quality assurance at scale
- โข A/B testing LLM responses
- โข Automated content moderation
Avoid When
- โข Subjective creative tasks
- โข Single output generation
- โข Real-time critical decisions
- โข Legal/medical assessments
๐ Key Metrics
๐ก Top Use Cases
References & Further Reading
Deepen your understanding with these curated resources
Academic Papers
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023)
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment (Liu et al., 2023)
Large Language Models are not Fair Evaluators (Wang et al., 2023)
Prometheus: Inducing Fine-grained Evaluation Capability in LLMs (Kim et al., 2024)
Contribute to this collection
Know a great resource? Submit a pull request to add it.
LLM as Judge(LJ)
Specific Producer-Critic implementation where an LLM acts as the critic to evaluate outputs
๐ฏ 30-Second Overview
Pattern: Use LLMs to evaluate and score other LLM outputs at scale
Why: Automates quality assessment, enables best-of-N selection, scales evaluation
Key Insight: Specific implementation of Producer-Critic where the critic is an LLM with evaluation prompts
โก Quick Implementation
๐ Do's & Don'ts
๐ฆ When to Use
Use When
- โข Evaluating multiple outputs
- โข Quality assurance at scale
- โข A/B testing LLM responses
- โข Automated content moderation
Avoid When
- โข Subjective creative tasks
- โข Single output generation
- โข Real-time critical decisions
- โข Legal/medical assessments
๐ Key Metrics
๐ก Top Use Cases
References & Further Reading
Deepen your understanding with these curated resources
Academic Papers
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023)
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment (Liu et al., 2023)
Large Language Models are not Fair Evaluators (Wang et al., 2023)
Prometheus: Inducing Fine-grained Evaluation Capability in LLMs (Kim et al., 2024)
Contribute to this collection
Know a great resource? Submit a pull request to add it.