Design Patterns & Techniques

🔗

Prompt Chaining

🔀

Routing

⚡

Parallelization

🪞

Reflection

🔧

Tool Use

🎯

Planning

👥

Multi-Agent

🧠

Memory Management

📈

Learning and Adaptation

🏗️

Fault Tolerance Infrastructure

📚

Knowledge Retrieval (RAG)

🧠

Reasoning Techniques

🔐

Security & Privacy Patterns

📊

Evaluation and Monitoring

🧠

Context Management

🎨

UI/UX & Human-AI Interaction

Loading...

📏

Adaptive Context Sizing(ACS)

Dynamically adjusts context window size and content selection based on task requirements and resource constraints

Complexity: mediumPattern

Core Mechanism

Adjusts how much context the model consumes per task by selecting, compressing, and budgeting only the most relevant information. Combines retrieval→rerank→compress pipelines, adaptive-k selection, and token-level mechanisms (e.g., learned or dynamic attention spans, KV-cache selection) to balance quality with latency, token cost, and memory.

Workflow / Steps

Assess task: estimate difficulty, novelty, and information needs from the query.
Candidate context: retrieve passages/snippets; consider prior turns and cached artifacts.
Adaptive selection: apply reranking and adaptive-k to choose how much to include.
Compression: summarize/deduplicate; keep citations and salient spans within token budget.
Assemble prompt: structure sections and provenance; respect hard token/latency budgets.
Generate + monitor: track utilization (lost-in-the-middle, citations, attention concentration).
Adapt/iterate: expand or contract context on uncertainty, evaluator failures, or gaps.

Best Practices

Use rerankers (cross-encoders) before packing; avoid naive long-context stuffing.

Apply adaptive-k: choose passages by score distribution rather than fixed k.

Compress with salience-aware summaries; preserve key entities, numbers, and citations.

Structure prompts with sections and provenance; mitigate "lost in the middle" by ordering.

Tune chunk sizes/overlap by tokenizer; measure long-context effectiveness, not just window size.

Log token budgets per stage; enforce hard caps and early-exit to protect p95 latency.

Cache retrievals, summaries, and KV; deduplicate across turns/sessions.

When NOT to Use

Uniform, simple queries where a fixed, short prompt meets quality and SLOs.
Hard real-time paths where retrieval/rerank/compression overhead breaks latency budgets.
Strictly deterministic/audited flows requiring fixed inputs and reproducibility.
Very long contexts without reranking where models under-utilize middle tokens.

Common Pitfalls

Context stuffing without reranking → token blowups with marginal quality gains.
Dropping provenance during compression → harder auditing and lower trust.
Uncapped k/overlap; no early-exit → cost and latency spikes.
Ignoring tokenizer/position effects → lost-in-the-middle and degraded utility.

Key Features

Adaptive-k passage selection and calibrated reranking

Contextual compression with salience and citation preservation

Token and latency budgeting with early-exit and fallbacks

Attention/KV optimization (learned spans, token-level KV selection)

Provenance-aware packing and ordering to reduce middle-token decay

Observability: per-stage tokens, costs, and utilization metrics

KPIs / Success Metrics

Answer quality/evaluator pass vs. fixed-k baseline.
Tokens and cost per successful answer; compression ratio vs. retention.
Latency P50/P95; throughput under load; cache hit rates.
Reranker MRR/NDCG; retrieval hit rate; citation accuracy.

Token / Resource Usage

Total ≈ retrieval + rerank + compression + packed context + generation.
Budget using hard caps (input/output tokens, max k, max overlap) and early-exit policies.
Leverage KV/prefix caches; prefer paged/flash attention for long contexts.
Use small models for scoring/reranking; reserve large models for final generation.

Best Use Cases

Long-document QA and synthesis with strong provenance requirements.
Conversational assistants balancing history vs. fresh retrieval.
Cost/latency-sensitive production RAG with variable difficulty distribution.
On-device or memory-constrained deployments requiring aggressive compression.

References & Further Reading

Academic Papers

Implementation Guides

Tools & Libraries

Community & Discussions

📏

Adaptive Context Sizing(ACS)

Dynamically adjusts context window size and content selection based on task requirements and resource constraints

Complexity: mediumPattern

Core Mechanism

Workflow / Steps

Assess task: estimate difficulty, novelty, and information needs from the query.
Candidate context: retrieve passages/snippets; consider prior turns and cached artifacts.
Adaptive selection: apply reranking and adaptive-k to choose how much to include.
Compression: summarize/deduplicate; keep citations and salient spans within token budget.
Assemble prompt: structure sections and provenance; respect hard token/latency budgets.
Generate + monitor: track utilization (lost-in-the-middle, citations, attention concentration).
Adapt/iterate: expand or contract context on uncertainty, evaluator failures, or gaps.

Best Practices

Use rerankers (cross-encoders) before packing; avoid naive long-context stuffing.

Apply adaptive-k: choose passages by score distribution rather than fixed k.

Compress with salience-aware summaries; preserve key entities, numbers, and citations.

Structure prompts with sections and provenance; mitigate "lost in the middle" by ordering.

Tune chunk sizes/overlap by tokenizer; measure long-context effectiveness, not just window size.

Log token budgets per stage; enforce hard caps and early-exit to protect p95 latency.

Cache retrievals, summaries, and KV; deduplicate across turns/sessions.

When NOT to Use

Uniform, simple queries where a fixed, short prompt meets quality and SLOs.
Hard real-time paths where retrieval/rerank/compression overhead breaks latency budgets.
Strictly deterministic/audited flows requiring fixed inputs and reproducibility.
Very long contexts without reranking where models under-utilize middle tokens.

Common Pitfalls

Context stuffing without reranking → token blowups with marginal quality gains.
Dropping provenance during compression → harder auditing and lower trust.
Uncapped k/overlap; no early-exit → cost and latency spikes.
Ignoring tokenizer/position effects → lost-in-the-middle and degraded utility.

Key Features

Adaptive-k passage selection and calibrated reranking

Contextual compression with salience and citation preservation

Token and latency budgeting with early-exit and fallbacks

Attention/KV optimization (learned spans, token-level KV selection)

Provenance-aware packing and ordering to reduce middle-token decay

Observability: per-stage tokens, costs, and utilization metrics

KPIs / Success Metrics

Answer quality/evaluator pass vs. fixed-k baseline.
Tokens and cost per successful answer; compression ratio vs. retention.
Latency P50/P95; throughput under load; cache hit rates.
Reranker MRR/NDCG; retrieval hit rate; citation accuracy.

Token / Resource Usage

Total ≈ retrieval + rerank + compression + packed context + generation.
Budget using hard caps (input/output tokens, max k, max overlap) and early-exit policies.
Leverage KV/prefix caches; prefer paged/flash attention for long contexts.
Use small models for scoring/reranking; reserve large models for final generation.

Best Use Cases

Long-document QA and synthesis with strong provenance requirements.
Conversational assistants balancing history vs. fresh retrieval.
Cost/latency-sensitive production RAG with variable difficulty distribution.
On-device or memory-constrained deployments requiring aggressive compression.

References & Further Reading

Academic Papers

Implementation Guides

Tools & Libraries

Community & Discussions

Patterns

closed

Design Patterns & Techniques

🔗

Prompt Chaining

🔀

Routing

⚡

Parallelization

🪞

Reflection

🔧

Tool Use

🎯

Planning

👥

Multi-Agent

🧠

Memory Management

📈

Learning and Adaptation

🏗️

Fault Tolerance Infrastructure

📚

Knowledge Retrieval (RAG)

🧠

Reasoning Techniques

🔐

Security & Privacy Patterns

📊

Evaluation and Monitoring

🧠

Context Management

🎨

Agentic Design

Agentic Design

Design Patterns & Techniques

Prompt Chaining

Routing

Parallelization

Reflection

Tool Use

Planning

Multi-Agent

Memory Management

Learning and Adaptation

Fault Tolerance Infrastructure

Knowledge Retrieval (RAG)

Reasoning Techniques

Security & Privacy Patterns

Evaluation and Monitoring

Context Management

UI/UX & Human-AI Interaction

Loading...

Adaptive Context Sizing(ACS)

Core Mechanism

Workflow / Steps

Best Practices

When NOT to Use

Common Pitfalls

Key Features

KPIs / Success Metrics

Token / Resource Usage

Best Use Cases

References & Further Reading

Academic Papers

Implementation Guides

Tools & Libraries

Community & Discussions

Adaptive Context Sizing(ACS)

Core Mechanism

Workflow / Steps

Best Practices

When NOT to Use

Common Pitfalls

Key Features

KPIs / Success Metrics

Token / Resource Usage

Best Use Cases

References & Further Reading

Academic Papers

Implementation Guides

Tools & Libraries

Community & Discussions

Patterns

Design Patterns & Techniques

Prompt Chaining

Routing

Parallelization

Reflection

Tool Use

Planning

Multi-Agent

Memory Management

Learning and Adaptation

Fault Tolerance Infrastructure

Knowledge Retrieval (RAG)

Reasoning Techniques

Security & Privacy Patterns

Evaluation and Monitoring

Context Management

UI/UX & Human-AI Interaction

Loading...

Design Patterns & Techniques

Prompt Chaining

Routing

Parallelization

Reflection

Tool Use

Planning

Multi-Agent

Memory Management

Learning and Adaptation

Fault Tolerance Infrastructure