Agentic Design

Patterns
πŸ’°

Cost-Aware Model Selection(CAMS)

Intelligently selects AI models based on cost-performance trade-offs for specific tasks

Complexity: mediumPattern

Core Mechanism

Cost-Aware Model Selection dynamically routes requests across multiple models (and providers) to optimize the cost–quality–latency trade-off per task. It typically combines lightweight heuristics or learned routers with cascades: attempt a low-cost model first, escalate to higher-capability models only when confidence is low, quality thresholds are not met, or SLAs require it. Budgets, quality gates, and per-tenant policies govern real-time decisions with continuous feedback from evaluation data.

Workflow / Steps

  1. Define objectives: target quality (e.g., pass rate on a golden set), latency SLOs, and budget caps.
  2. Inventory models: capabilities, pricing per 1K tokens (input/output), context limits, regions, reliability.
  3. Build an evaluation set: representative prompts with ground truth or human-rated quality rubrics.
  4. Design routing policy: rules or ML router using features (task type, length, uncertainty/confidence, user tier).
  5. Implement cascades: start with cheaper/faster models; escalate on low confidence, safety triggers, or quality shortfall.
  6. Add governance: per-tenant budgets, regional/compliance routing, allow/deny lists, kill switches.
  7. Monitor and learn: log tokens, cost, latency, quality; periodically retrain routers and refresh pricing tables.
  8. Release safely: canary new routes/models, run A/B against baseline, roll back on KPI regression.

Best Practices

Segment traffic by priority/tier; reserve premium models for high-value or strict-SLA requests.
Maintain up-to-date pricing and tokenization differences per provider; verify input vs. output token costs.
Use measurable quality thresholds (rubrics, automated checks) and confidence/uncertainty gating.
Cache frequent results and partial reasoning; reuse embeddings/summaries to reduce repeated tokens.
Design for graceful degradation and explicit fallbacks across providers/regions.
Continuously evaluate with a golden set; monitor drift and re-tune router thresholds regularly.
Log granular telemetry: tokens, cost, latency percentiles, route decisions and reasons.
Isolate tenants with per-tenant budgets/quotas; protect critical paths with strict SLAs.

When NOT to Use

  • Single-model deployments that already meet cost, quality, and latency targets with low variance.
  • Regulatory/contractual constraints forbidding cross-region/provider routing or quality variance.
  • Very low traffic where router complexity and observability overhead outweigh cost savings.

Common Pitfalls

  • Stale pricing/capabilities tables leading to suboptimal or non-compliant routing.
  • Missing fallbacks and budget guards; outages or spikes cause failures and runaway spend.
  • Quality regressions from uncalibrated confidence thresholds or evaluation drift.
  • Ignoring tokenization differences (providers count tokens differently) and output token multipliers.
  • Mixing sensitive data with cheaper models that lack regional/compliance guarantees.

Key Features

Multi-model pool with provider failover and regional routing
Quality gates with confidence/uncertainty thresholds and automated checks
Budget caps per tenant/route with real-time enforcement and alerts
Dynamic cascades and progressive escalation
Offline evaluation sets and continuous learning routers
Canary releases, policy-based overrides, and kill switches

KPIs / Success Metrics

  • Cost efficiency: $/request, tokens per dollar, budget adherence rate.
  • Quality: pass rate vs. golden set, human rating uplift, escalation rate.
  • Routing accuracy: agreement with oracle/baseline router; avoidable escalations avoided.
  • Latency: p50/p95 end-to-end and per-route; escalation overhead.
  • Reliability: timeout/error rate, failover success, on-call incidents avoided.

Token / Resource Usage

  • Track input/output tokens separately; many providers price them differently.
  • Measure cascade overhead (multiple calls on one request); gate by confidence to avoid unnecessary escalations.
  • Constrain context length; apply compression/summarization and retrieval planning to control tokens.
  • Use caching for frequent prompts/intermediates; persist embeddings/summaries for reuse.

Best Use Cases

  • High-volume support automation with strict budgets and variable difficulty.
  • Summarization, extraction, and Q&A where many requests are easy but some need escalation.
  • Multi-tenant platforms offering cost/quality tiers and enterprise SLAs.
  • Global deployments requiring regional routing and compliance-aware fallback.

References & Further Reading

Patterns

closed

Loading...