Design Patterns & Techniques

🔗

Prompt Chaining

🔀

Routing

⚡

Parallelization

🪞

Reflection

🔧

Tool Use

🎯

Planning

👥

Multi-Agent

🧠

Memory Management

📈

Learning and Adaptation

🏗️

Fault Tolerance Infrastructure

📚

Knowledge Retrieval (RAG)

🧠

Reasoning Techniques

🔐

Security & Privacy Patterns

📊

Evaluation and Monitoring

🧠

Context Management

🎨

UI/UX & Human-AI Interaction

Loading...

⚡

Resource-Aware Scheduling(RAS)

Dynamically schedules AI tasks based on available computational resources and constraints

Complexity: highPattern

Core Mechanism

Resource-Aware Scheduling dynamically matches AI tasks with compute resources (CPU/GPU/TPU, memory, bandwidth) using live telemetry and SLO/cost constraints. It combines techniques like dynamic batching, queueing, priority and preemption, model placement, and autoscaling to meet latency/throughput targets while maximizing utilization and respecting power/thermal and budget limits.

Workflow / Steps

Profile tasks and models: latency SLOs, batchability, memory footprints (KV cache/weights), throughput targets.
Instrument resources: collect CPU/GPU utilization, memory, queue depth, temperature, power, and cost signals.
Admission control & routing: classify request priority/tier; route to suitable model/instance/region.
Scheduling & batching: form dynamic batches within SLO; apply priority, preemption, or isolation where needed.
Autoscaling & placement: scale replicas by queue/latency; place models to satisfy memory/affinity constraints.
Feedback control: monitor SLO attainment; adapt batch size/concurrency, and adjust priorities/budgets.

Best Practices

Use dynamic batching with per-tier latency budgets; separate low-latency from best-effort queues.

Right-size model placement by memory (weights + KV cache) and compute; avoid page thrash and OOM.

Implement priority and preemption for urgent requests; protect background jobs with quotas.

Tune concurrency per instance based on utilization and headroom; avoid head-of-line blocking.

Autoscale on queue depth and p95 latency; use warm pools for burstiness; prefer bin-packing for cost.

Instrument end-to-end: request, queue, batch, inference, and postprocessing stages with traces/metrics.

Apply regional/zone routing and model replicas for resilience; drain gracefully during rollouts.

Cap token/context budgets; compress/stream to meet SLOs and control spend.

When NOT to Use

Ultra-simplified/homogeneous workloads where static round-robin meets SLOs cheaply.
Hard real-time control loops where any batching/queuing overhead risks deadline misses.
Single-tenant, overprovisioned environments with negligible contention and stable load.

Common Pitfalls

Over-batching causing p95/p99 latency regressions; ignoring per-tier SLO budgets.
Insufficient GPU memory accounting for KV cache/token growth → OOMs and restarts.
Hotspot queues and head-of-line blocking; lack of isolation between tiers/tenants.
Reactive-only autoscaling with cold start penalties; no warm capacity for bursts.
Poor observability: missing queue time vs. inference time breakdown; blind tuning.

Key Features

Real-time resource telemetry and SLO-aware routing

Dynamic batching and token-aware scheduling

Priority queues, preemption, and isolation per tier/tenant

Placement aware of memory/affinity and NUMA/GPU topology

Autoscaling with warm pools and bin-packing for cost

End-to-end traces, metrics, and policy-driven budgets

KPIs / Success Metrics

Latency: p50/p95 end-to-end and queue time share; SLO attainment rate per tier.
Throughput: requests/s and tokens/s per instance and fleet; batch effectiveness.
Utilization: GPU SM/memory utilization, CPU utilization, saturation indicators.
Cost and efficiency: cost/request, tokens per dollar, energy per token.
Stability: error rate, OOM/restart rate, autoscale convergence time.

Token / Resource Usage

Budget input/output tokens per tier; enforce max context and apply compression/summarization.
Track KV cache growth vs. batch size and sequence length; reserve headroom to prevent OOM.
Use continuous/dynamic batching to improve tokens/s while honoring per-request deadlines.
Prefer streaming for tight SLOs; cut off long generations or down-tier models on budget pressure.

Best Use Cases

Multi-tenant LLM APIs with tiered latency/cost SLOs and bursty demand.
Real-time assistants and chat where TTFT and p95 latency are critical.
Batch/stream inference services in cloud and Kubernetes with autoscaling.
Edge/mobile deployments balancing battery, thermal limits, and offline modes.
Autonomous systems requiring deadline-aware perception/planning pipelines.

References & Further Reading

⚡

Resource-Aware Scheduling(RAS)

Dynamically schedules AI tasks based on available computational resources and constraints

Complexity: highPattern

Core Mechanism

Workflow / Steps

Profile tasks and models: latency SLOs, batchability, memory footprints (KV cache/weights), throughput targets.
Instrument resources: collect CPU/GPU utilization, memory, queue depth, temperature, power, and cost signals.
Admission control & routing: classify request priority/tier; route to suitable model/instance/region.
Scheduling & batching: form dynamic batches within SLO; apply priority, preemption, or isolation where needed.
Autoscaling & placement: scale replicas by queue/latency; place models to satisfy memory/affinity constraints.
Feedback control: monitor SLO attainment; adapt batch size/concurrency, and adjust priorities/budgets.