Patterns
πŸ“

Adaptive Context Sizing(ACS)

Dynamically adjusts context window size and content selection based on task requirements and resource constraints

Complexity: mediumPattern

Core Mechanism

Adjusts how much context the model consumes per task by selecting, compressing, and budgeting only the most relevant information. Combines retrieval→rerank→compress pipelines, adaptive-k selection, and token-level mechanisms (e.g., learned or dynamic attention spans, KV-cache selection) to balance quality with latency, token cost, and memory.

Workflow / Steps

  1. Assess task: estimate difficulty, novelty, and information needs from the query.
  2. Candidate context: retrieve passages/snippets; consider prior turns and cached artifacts.
  3. Adaptive selection: apply reranking and adaptive-k to choose how much to include.
  4. Compression: summarize/deduplicate; keep citations and salient spans within token budget.
  5. Assemble prompt: structure sections and provenance; respect hard token/latency budgets.
  6. Generate + monitor: track utilization (lost-in-the-middle, citations, attention concentration).
  7. Adapt/iterate: expand or contract context on uncertainty, evaluator failures, or gaps.

Best Practices

Use rerankers (cross-encoders) before packing; avoid naive long-context stuffing.
Apply adaptive-k: choose passages by score distribution rather than fixed k.
Compress with salience-aware summaries; preserve key entities, numbers, and citations.
Structure prompts with sections and provenance; mitigate "lost in the middle" by ordering.
Tune chunk sizes/overlap by tokenizer; measure long-context effectiveness, not just window size.
Log token budgets per stage; enforce hard caps and early-exit to protect p95 latency.
Cache retrievals, summaries, and KV; deduplicate across turns/sessions.

When NOT to Use

  • Uniform, simple queries where a fixed, short prompt meets quality and SLOs.
  • Hard real-time paths where retrieval/rerank/compression overhead breaks latency budgets.
  • Strictly deterministic/audited flows requiring fixed inputs and reproducibility.
  • Very long contexts without reranking where models under-utilize middle tokens.

Common Pitfalls

  • Context stuffing without reranking β†’ token blowups with marginal quality gains.
  • Dropping provenance during compression β†’ harder auditing and lower trust.
  • Uncapped k/overlap; no early-exit β†’ cost and latency spikes.
  • Ignoring tokenizer/position effects β†’ lost-in-the-middle and degraded utility.

Key Features

Adaptive-k passage selection and calibrated reranking
Contextual compression with salience and citation preservation
Token and latency budgeting with early-exit and fallbacks
Attention/KV optimization (learned spans, token-level KV selection)
Provenance-aware packing and ordering to reduce middle-token decay
Observability: per-stage tokens, costs, and utilization metrics

KPIs / Success Metrics

  • Answer quality/evaluator pass vs. fixed-k baseline.
  • Tokens and cost per successful answer; compression ratio vs. retention.
  • Latency P50/P95; throughput under load; cache hit rates.
  • Reranker MRR/NDCG; retrieval hit rate; citation accuracy.

Token / Resource Usage

  • Total β‰ˆ retrieval + rerank + compression + packed context + generation.
  • Budget using hard caps (input/output tokens, max k, max overlap) and early-exit policies.
  • Leverage KV/prefix caches; prefer paged/flash attention for long contexts.
  • Use small models for scoring/reranking; reserve large models for final generation.

Best Use Cases

  • Long-document QA and synthesis with strong provenance requirements.
  • Conversational assistants balancing history vs. fresh retrieval.
  • Cost/latency-sensitive production RAG with variable difficulty distribution.
  • On-device or memory-constrained deployments requiring aggressive compression.

References & Further Reading

Patterns

closed

Loading...

Built by Kortexya