Design Patterns & Techniques

🔗

Prompt Chaining

🔀

Routing

⚡

Parallelization

🪞

Reflection

🔧

Tool Use

🎯

Planning

👥

Multi-Agent

🧠

Memory Management

📈

Learning and Adaptation

🏗️

Fault Tolerance Infrastructure

📚

Knowledge Retrieval (RAG)

🧠

Reasoning Techniques

🔐

Security & Privacy Patterns

📊

Evaluation and Monitoring

🧠

Context Management

🎨

UI/UX & Human-AI Interaction

Loading...

💰

Cost-Aware Model Selection(CAMS)

Intelligently selects AI models based on cost-performance trade-offs for specific tasks

Complexity: mediumPattern

Core Mechanism

Cost-Aware Model Selection dynamically routes requests across multiple models (and providers) to optimize the cost–quality–latency trade-off per task. It typically combines lightweight heuristics or learned routers with cascades: attempt a low-cost model first, escalate to higher-capability models only when confidence is low, quality thresholds are not met, or SLAs require it. Budgets, quality gates, and per-tenant policies govern real-time decisions with continuous feedback from evaluation data.

Workflow / Steps

Define objectives: target quality (e.g., pass rate on a golden set), latency SLOs, and budget caps.
Inventory models: capabilities, pricing per 1K tokens (input/output), context limits, regions, reliability.
Build an evaluation set: representative prompts with ground truth or human-rated quality rubrics.
Design routing policy: rules or ML router using features (task type, length, uncertainty/confidence, user tier).
Implement cascades: start with cheaper/faster models; escalate on low confidence, safety triggers, or quality shortfall.
Add governance: per-tenant budgets, regional/compliance routing, allow/deny lists, kill switches.
Monitor and learn: log tokens, cost, latency, quality; periodically retrain routers and refresh pricing tables.
Release safely: canary new routes/models, run A/B against baseline, roll back on KPI regression.

Best Practices

Segment traffic by priority/tier; reserve premium models for high-value or strict-SLA requests.

Maintain up-to-date pricing and tokenization differences per provider; verify input vs. output token costs.

Use measurable quality thresholds (rubrics, automated checks) and confidence/uncertainty gating.

Cache frequent results and partial reasoning; reuse embeddings/summaries to reduce repeated tokens.

Design for graceful degradation and explicit fallbacks across providers/regions.

Continuously evaluate with a golden set; monitor drift and re-tune router thresholds regularly.

Log granular telemetry: tokens, cost, latency percentiles, route decisions and reasons.

Isolate tenants with per-tenant budgets/quotas; protect critical paths with strict SLAs.

When NOT to Use

Single-model deployments that already meet cost, quality, and latency targets with low variance.
Regulatory/contractual constraints forbidding cross-region/provider routing or quality variance.
Very low traffic where router complexity and observability overhead outweigh cost savings.

Common Pitfalls

Stale pricing/capabilities tables leading to suboptimal or non-compliant routing.
Missing fallbacks and budget guards; outages or spikes cause failures and runaway spend.
Quality regressions from uncalibrated confidence thresholds or evaluation drift.
Ignoring tokenization differences (providers count tokens differently) and output token multipliers.
Mixing sensitive data with cheaper models that lack regional/compliance guarantees.

Key Features

Multi-model pool with provider failover and regional routing

Quality gates with confidence/uncertainty thresholds and automated checks

Budget caps per tenant/route with real-time enforcement and alerts

Dynamic cascades and progressive escalation

Offline evaluation sets and continuous learning routers

Canary releases, policy-based overrides, and kill switches

KPIs / Success Metrics

Cost efficiency: $/request, tokens per dollar, budget adherence rate.
Quality: pass rate vs. golden set, human rating uplift, escalation rate.
Routing accuracy: agreement with oracle/baseline router; avoidable escalations avoided.
Latency: p50/p95 end-to-end and per-route; escalation overhead.
Reliability: timeout/error rate, failover success, on-call incidents avoided.

Token / Resource Usage

Track input/output tokens separately; many providers price them differently.
Measure cascade overhead (multiple calls on one request); gate by confidence to avoid unnecessary escalations.
Constrain context length; apply compression/summarization and retrieval planning to control tokens.
Use caching for frequent prompts/intermediates; persist embeddings/summaries for reuse.

Best Use Cases

High-volume support automation with strict budgets and variable difficulty.
Summarization, extraction, and Q&A where many requests are easy but some need escalation.
Multi-tenant platforms offering cost/quality tiers and enterprise SLAs.
Global deployments requiring regional routing and compliance-aware fallback.

References & Further Reading

Academic Papers

Implementation Guides

Tools & Libraries

LiteLLM Router, OpenRouter, LangChain RouterChain, LlamaIndex Router
Ray Serve, vLLM, NVIDIA Triton for serving and observability
Langfuse, Phoenix (Arize) for evaluation and telemetry

Community & Discussions

💰

Cost-Aware Model Selection(CAMS)

Intelligently selects AI models based on cost-performance trade-offs for specific tasks

Complexity: mediumPattern

Core Mechanism

Workflow / Steps

Define objectives: target quality (e.g., pass rate on a golden set), latency SLOs, and budget caps.
Inventory models: capabilities, pricing per 1K tokens (input/output), context limits, regions, reliability.
Build an evaluation set: representative prompts with ground truth or human-rated quality rubrics.
Design routing policy: rules or ML router using features (task type, length, uncertainty/confidence, user tier).
Implement cascades: start with cheaper/faster models; escalate on low confidence, safety triggers, or quality shortfall.
Add governance: per-tenant budgets, regional/compliance routing, allow/deny lists, kill switches.
Monitor and learn: log tokens, cost, latency, quality; periodically retrain routers and refresh pricing tables.
Release safely: canary new routes/models, run A/B against baseline, roll back on KPI regression.