Design Patterns & Techniques

🔗

Prompt Chaining

🔀

Routing

⚡

Parallelization

🪞

Reflection

🔧

Tool Use

🎯

Planning

👥

Multi-Agent

🧠

Memory Management

📈

Learning and Adaptation

🏗️

Fault Tolerance Infrastructure

📚

Knowledge Retrieval (RAG)

🧠

Reasoning Techniques

🔐

Security & Privacy Patterns

📊

Evaluation and Monitoring

🧠

Context Management

🎨

UI/UX & Human-AI Interaction

Loading...

⚡

Test-Time Scaling(TTS)

Improving model performance through increased computation during inference rather than larger models

Complexity: highLearning and Adaptation

🎯 30-Second Overview

Pattern: Allocate additional compute at inference time to improve performance through multiple attempts and verification

Why: Enables better accuracy on complex tasks, flexible compute allocation, and performance scaling without retraining

Key Insight: More inference-time compute can substitute for larger models or more training data on reasoning tasks

⚡ Quick Implementation

1Generate:Create multiple reasoning paths or solutions

2Verify:Use verification model or self-consistency checks

3Rank:Score and rank solutions by quality/confidence

4Select:Choose best solution or aggregate top candidates

5Scale:Increase compute budget for better performance

Example: query → multiple_attempts → verification/ranking → best_solution + increased_compute

📋 Do's & Don'ts

✅Use process-based verification over outcome-only evaluation

✅Implement majority voting and self-consistency checks

✅Scale compute allocation based on problem difficulty

✅Use search algorithms and tree-based reasoning

✅Monitor latency vs accuracy trade-offs carefully

✅Implement early stopping for confident solutions

❌Apply uniform compute to all problems regardless of difficulty

❌Ignore verification quality and just generate more samples

❌Use test-time scaling without proper evaluation frameworks

❌Scale compute without considering inference cost budgets

❌Rely solely on quantity without improving reasoning quality

🚦 When to Use

Use When

• Complex reasoning tasks requiring multiple solution paths
• High-stakes decisions where accuracy is prioritized over speed
• Problems where verification is easier than generation
• Tasks with clear objective evaluation criteria
• Applications with flexible inference time budgets

Avoid When

• Simple tasks with straightforward solutions
• Real-time applications with strict latency requirements
• Problems without reliable verification methods
• Cost-sensitive applications with tight budgets
• Tasks where first attempt is typically sufficient

📊 Key Metrics

Accuracy@K

Best performance among K attempts

Compute Efficiency

Performance gain per unit compute

Verification Accuracy

Quality of solution ranking/selection

Latency Scaling

Inference time vs compute allocation

Pass@K Rate

Success rate within K attempts

Cost-Performance Ratio

Accuracy improvement per dollar spent

💡 Top Use Cases

Mathematical Reasoning: Multiple solution paths for complex proofs and problem solving

Code Generation: Generate and verify multiple implementations to find optimal solutions

Scientific Discovery: Explore multiple hypotheses and experimental designs

Strategic Planning: Evaluate multiple scenarios and decision pathways

Creative Problem Solving: Generate diverse solutions and select most promising approaches

Competitive Programming: Systematic solution exploration with verification

References & Further Reading

Deepen your understanding with these curated resources

Foundational Papers

Training Verifiers to Solve Math Word Problems (Cobbe et al., 2021)

Let's Verify Step by Step (Lightman et al., 2023)

Scaling Laws for Reward Model Overoptimization (Gao et al., 2022)

Self-Consistency Improves Chain of Thought Reasoning (Wang et al., 2022)

Test-Time Scaling Methods

Scaling Laws for Neural Language Models (Kaplan et al., 2020)

Training Compute-Optimal Large Language Models (Hoffmann et al., 2022)

Large Language Models Can Self-Improve (Huang et al., 2022)

STaR: Bootstrapping Reasoning With Reasoning (Zelikman et al., 2022)

Verification & Process Supervision

Process Supervision for Reliable Reasoning (OpenAI, 2023)

Let's Verify Step by Step - Process vs Outcome Supervision (OpenAI, 2023)

Training Verifiers to Solve Math Word Problems (Cobbe et al., 2021)

Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022)

Search & Tree-Based Reasoning

Tree of Thoughts: Deliberate Problem Solving with Large Language Models (Yao et al., 2023)

Graph of Thoughts: Solving Elaborate Problems with Large Language Models (Besta et al., 2023)

AlphaCode: Competition-Level Code Generation with Search (Li et al., 2022)

Learning to Search with Language Models (Beurer-Kellner et al., 2023)

Self-Consistency & Majority Voting

Self-Consistency Improves Chain of Thought Reasoning in Language Models (Wang et al., 2022)

Complex Reasoning: The Divide and Conquer Approach (Zhou et al., 2022)

Least-to-Most Prompting Enables Complex Reasoning (Zhou et al., 2022)

Universal Self-Consistency for Large Language Model Generation (Chen et al., 2023)

Recent Advances (2023-2024)

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking (Zelikman et al., 2024)

Rest-of-World Latent Search (RoWLS) for Test-Time Scaling (Chen et al., 2024)

Test-Time Training for Large Language Models (Liu et al., 2024)

Inference-Time Scaling Laws for Large Language Models (Snell et al., 2024)

Mathematical & Scientific Reasoning

Solving Quantitative Reasoning Problems with Language Models (Lewkowycz et al., 2022)

MATH: Measuring Mathematical Problem Solving (Hendrycks et al., 2021)

Competition-Level Mathematics with AlphaGeometry (Trinh et al., 2024)

FunSearch: Making New Discoveries in Mathematical Sciences (Romera-Paredes et al., 2023)

Code Generation & Programming

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models (Wang et al., 2021)

Competition-Level Code Generation with AlphaCode (Li et al., 2022)

Code as Policies: Language Model Programs for Embodied Control (Liang et al., 2022)

Self-Debugging: Teaching Language Models to Debug Programs (Chen et al., 2023)

Implementation Frameworks

OpenAI Process Supervision Implementation

Anthropic Constitutional AI Framework

Hugging Face Transformers - Generation Strategies

Tree of Thoughts Implementation

Evaluation & Benchmarks

GSM8K: Grade School Math Word Problems Dataset

MATH Dataset: Competition Mathematics Problems

HumanEval: Hand-Written Evaluation for Code Generation

MMLU: Measuring Massive Multitask Language Understanding

Tools & Libraries

LangChain: Test-Time Generation Strategies

Guidance: Controllable Generation Framework

vLLM: High-Throughput and Memory-Efficient Inference

DeepSpeed: Distributed Training and Inference

Research Communities

OpenAI Research - Reasoning and Verification

Anthropic Research - Constitutional AI and Safety

DeepMind - Mathematical Reasoning Research

AI2 Allen Institute - Reasoning Benchmarks

Contribute to this collection

Know a great resource? Submit a pull request to add it.

Contribute

⚡

Test-Time Scaling(TTS)

Improving model performance through increased computation during inference rather than larger models

Complexity: highLearning and Adaptation

🎯 30-Second Overview

Pattern: Allocate additional compute at inference time to improve performance through multiple attempts and verification

Why: Enables better accuracy on complex tasks, flexible compute allocation, and performance scaling without retraining

Key Insight: More inference-time compute can substitute for larger models or more training data on reasoning tasks

⚡ Quick Implementation

1Generate:Create multiple reasoning paths or solutions

2Verify:Use verification model or self-consistency checks

3Rank:Score and rank solutions by quality/confidence

4Select:Choose best solution or aggregate top candidates

5Scale:Increase compute budget for better performance

Example: query → multiple_attempts → verification/ranking → best_solution + increased_compute

📋 Do's & Don'ts

✅Use process-based verification over outcome-only evaluation

✅Implement majority voting and self-consistency checks

✅Scale compute allocation based on problem difficulty

✅Use search algorithms and tree-based reasoning

✅Monitor latency vs accuracy trade-offs carefully

✅Implement early stopping for confident solutions

❌Apply uniform compute to all problems regardless of difficulty

❌Ignore verification quality and just generate more samples

❌Use test-time scaling without proper evaluation frameworks

❌Scale compute without considering inference cost budgets

❌Rely solely on quantity without improving reasoning quality

🚦 When to Use

Use When

• Complex reasoning tasks requiring multiple solution paths
• High-stakes decisions where accuracy is prioritized over speed
• Problems where verification is easier than generation
• Tasks with clear objective evaluation criteria
• Applications with flexible inference time budgets

Avoid When

• Simple tasks with straightforward solutions
• Real-time applications with strict latency requirements
• Problems without reliable verification methods
• Cost-sensitive applications with tight budgets
• Tasks where first attempt is typically sufficient

📊 Key Metrics

Accuracy@K

Best performance among K attempts

Compute Efficiency

Performance gain per unit compute

Verification Accuracy

Quality of solution ranking/selection

Latency Scaling

Inference time vs compute allocation

Pass@K Rate

Success rate within K attempts

Cost-Performance Ratio

Accuracy improvement per dollar spent

💡 Top Use Cases

Mathematical Reasoning: Multiple solution paths for complex proofs and problem solving

Code Generation: Generate and verify multiple implementations to find optimal solutions

Scientific Discovery: Explore multiple hypotheses and experimental designs

Strategic Planning: Evaluate multiple scenarios and decision pathways

Creative Problem Solving: Generate diverse solutions and select most promising approaches

Competitive Programming: Systematic solution exploration with verification

References & Further Reading

Deepen your understanding with these curated resources

Foundational Papers

Training Verifiers to Solve Math Word Problems (Cobbe et al., 2021)

Let's Verify Step by Step (Lightman et al., 2023)

Scaling Laws for Reward Model Overoptimization (Gao et al., 2022)

Self-Consistency Improves Chain of Thought Reasoning (Wang et al., 2022)

Test-Time Scaling Methods

Scaling Laws for Neural Language Models (Kaplan et al., 2020)

Training Compute-Optimal Large Language Models (Hoffmann et al., 2022)

Large Language Models Can Self-Improve (Huang et al., 2022)

STaR: Bootstrapping Reasoning With Reasoning (Zelikman et al., 2022)

Verification & Process Supervision

Process Supervision for Reliable Reasoning (OpenAI, 2023)

Let's Verify Step by Step - Process vs Outcome Supervision (OpenAI, 2023)

Training Verifiers to Solve Math Word Problems (Cobbe et al., 2021)

Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022)

Search & Tree-Based Reasoning

Tree of Thoughts: Deliberate Problem Solving with Large Language Models (Yao et al., 2023)

Graph of Thoughts: Solving Elaborate Problems with Large Language Models (Besta et al., 2023)

AlphaCode: Competition-Level Code Generation with Search (Li et al., 2022)

Learning to Search with Language Models (Beurer-Kellner et al., 2023)

Self-Consistency & Majority Voting

Self-Consistency Improves Chain of Thought Reasoning in Language Models (Wang et al., 2022)

Complex Reasoning: The Divide and Conquer Approach (Zhou et al., 2022)

Least-to-Most Prompting Enables Complex Reasoning (Zhou et al., 2022)

Universal Self-Consistency for Large Language Model Generation (Chen et al., 2023)

Recent Advances (2023-2024)

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking (Zelikman et al., 2024)

Rest-of-World Latent Search (RoWLS) for Test-Time Scaling (Chen et al., 2024)

Test-Time Training for Large Language Models (Liu et al., 2024)

Inference-Time Scaling Laws for Large Language Models (Snell et al., 2024)

Mathematical & Scientific Reasoning

Solving Quantitative Reasoning Problems with Language Models (Lewkowycz et al., 2022)

MATH: Measuring Mathematical Problem Solving (Hendrycks et al., 2021)

Competition-Level Mathematics with AlphaGeometry (Trinh et al., 2024)

FunSearch: Making New Discoveries in Mathematical Sciences (Romera-Paredes et al., 2023)

Code Generation & Programming

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models (Wang et al., 2021)

Competition-Level Code Generation with AlphaCode (Li et al., 2022)

Code as Policies: Language Model Programs for Embodied Control (Liang et al., 2022)

Self-Debugging: Teaching Language Models to Debug Programs (Chen et al., 2023)

Implementation Frameworks

OpenAI Process Supervision Implementation

Anthropic Constitutional AI Framework

Hugging Face Transformers - Generation Strategies

Tree of Thoughts Implementation

Evaluation & Benchmarks

GSM8K: Grade School Math Word Problems Dataset

MATH Dataset: Competition Mathematics Problems

HumanEval: Hand-Written Evaluation for Code Generation

MMLU: Measuring Massive Multitask Language Understanding

Tools & Libraries

LangChain: Test-Time Generation Strategies

Guidance: Controllable Generation Framework

vLLM: High-Throughput and Memory-Efficient Inference

DeepSpeed: Distributed Training and Inference

Research Communities

OpenAI Research - Reasoning and Verification

Anthropic Research - Constitutional AI and Safety

DeepMind - Mathematical Reasoning Research

AI2 Allen Institute - Reasoning Benchmarks

Contribute to this collection

Know a great resource? Submit a pull request to add it.

Contribute

Patterns

closed

Design Patterns & Techniques

🔗

Prompt Chaining

🔀

Routing

⚡

Parallelization

🪞

Reflection

🔧

Tool Use

🎯

Planning

👥

Multi-Agent

🧠

Memory Management

📈

Learning and Adaptation

🏗️

Fault Tolerance Infrastructure

📚

Knowledge Retrieval (RAG)

🧠

Reasoning Techniques

🔐

Security & Privacy Patterns

📊

Evaluation and Monitoring

🧠

Context Management

🎨

Agentic Design

Agentic Design

Design Patterns & Techniques

Prompt Chaining

Routing

Parallelization

Reflection

Tool Use

Planning

Multi-Agent

Memory Management

Learning and Adaptation

Reinforcement Learning from Human Feedback(RLHF)

Direct Preference Optimization(DPO)

In-Context Learning(ICL)

Meta-Learning Systems(MLS)

Continual Learning(CL)

Self-Improving Systems(SIS)

Constitutional AI(CAI)

Reinforcement Learning from AI Feedback(RLAIF)

Test-Time Scaling(TTS)

Odds Ratio Preference Optimization(ORPO)

Simple Preference Optimization(SimPO)

Supervised Learning for Agents(SLA)

Unsupervised Learning for Agents(ULA)

Online Learning for Agents(OLA)

Fault Tolerance Infrastructure

Knowledge Retrieval (RAG)

Reasoning Techniques

Security & Privacy Patterns

Evaluation and Monitoring

Context Management

UI/UX & Human-AI Interaction

Loading...

Test-Time Scaling(TTS)

🎯 30-Second Overview

⚡ Quick Implementation

📋 Do's & Don'ts

🚦 When to Use

Use When

Avoid When

📊 Key Metrics

💡 Top Use Cases

References & Further Reading

Foundational Papers

Test-Time Scaling Methods

Verification & Process Supervision

Search & Tree-Based Reasoning

Self-Consistency & Majority Voting

Recent Advances (2023-2024)

Mathematical & Scientific Reasoning

Code Generation & Programming

Implementation Frameworks

Evaluation & Benchmarks

Tools & Libraries

Research Communities

Contribute to this collection

Test-Time Scaling(TTS)

🎯 30-Second Overview

⚡ Quick Implementation

📋 Do's & Don'ts

🚦 When to Use

Use When

Avoid When

📊 Key Metrics

💡 Top Use Cases

References & Further Reading

Foundational Papers

Test-Time Scaling Methods

Verification & Process Supervision

Search & Tree-Based Reasoning

Self-Consistency & Majority Voting

Recent Advances (2023-2024)

Mathematical & Scientific Reasoning

Code Generation & Programming

Implementation Frameworks

Evaluation & Benchmarks

Tools & Libraries

Research Communities

Contribute to this collection