Loading...
Resource-Aware Scheduling(RAS)
Dynamically schedules AI tasks based on available computational resources and constraints
Core Mechanism
Resource-Aware Scheduling dynamically matches AI tasks with compute resources (CPU/GPU/TPU, memory, bandwidth) using live telemetry and SLO/cost constraints. It combines techniques like dynamic batching, queueing, priority and preemption, model placement, and autoscaling to meet latency/throughput targets while maximizing utilization and respecting power/thermal and budget limits.
Workflow / Steps
- Profile tasks and models: latency SLOs, batchability, memory footprints (KV cache/weights), throughput targets.
- Instrument resources: collect CPU/GPU utilization, memory, queue depth, temperature, power, and cost signals.
- Admission control & routing: classify request priority/tier; route to suitable model/instance/region.
- Scheduling & batching: form dynamic batches within SLO; apply priority, preemption, or isolation where needed.
- Autoscaling & placement: scale replicas by queue/latency; place models to satisfy memory/affinity constraints.
- Feedback control: monitor SLO attainment; adapt batch size/concurrency, and adjust priorities/budgets.
Best Practices
When NOT to Use
- Ultra-simplified/homogeneous workloads where static round-robin meets SLOs cheaply.
- Hard real-time control loops where any batching/queuing overhead risks deadline misses.
- Single-tenant, overprovisioned environments with negligible contention and stable load.
Common Pitfalls
- Over-batching causing p95/p99 latency regressions; ignoring per-tier SLO budgets.
- Insufficient GPU memory accounting for KV cache/token growth โ OOMs and restarts.
- Hotspot queues and head-of-line blocking; lack of isolation between tiers/tenants.
- Reactive-only autoscaling with cold start penalties; no warm capacity for bursts.
- Poor observability: missing queue time vs. inference time breakdown; blind tuning.
Key Features
KPIs / Success Metrics
- Latency: p50/p95 end-to-end and queue time share; SLO attainment rate per tier.
- Throughput: requests/s and tokens/s per instance and fleet; batch effectiveness.
- Utilization: GPU SM/memory utilization, CPU utilization, saturation indicators.
- Cost and efficiency: cost/request, tokens per dollar, energy per token.
- Stability: error rate, OOM/restart rate, autoscale convergence time.
Token / Resource Usage
- Budget input/output tokens per tier; enforce max context and apply compression/summarization.
- Track KV cache growth vs. batch size and sequence length; reserve headroom to prevent OOM.
- Use continuous/dynamic batching to improve tokens/s while honoring per-request deadlines.
- Prefer streaming for tight SLOs; cut off long generations or down-tier models on budget pressure.
Best Use Cases
- Multi-tenant LLM APIs with tiered latency/cost SLOs and bursty demand.
- Real-time assistants and chat where TTFT and p95 latency are critical.
- Batch/stream inference services in cloud and Kubernetes with autoscaling.
- Edge/mobile deployments balancing battery, thermal limits, and offline modes.
- Autonomous systems requiring deadline-aware perception/planning pipelines.
References & Further Reading
Academic Papers
Implementation Guides
Tools & Libraries
- NVIDIA Triton Inference Server, vLLM, Ray Serve, TorchServe
- Prometheus / Grafana, NVIDIA DCGM for telemetry
- Kubernetes, Istio for traffic shaping
Community & Discussions
Resource-Aware Scheduling(RAS)
Dynamically schedules AI tasks based on available computational resources and constraints
Core Mechanism
Resource-Aware Scheduling dynamically matches AI tasks with compute resources (CPU/GPU/TPU, memory, bandwidth) using live telemetry and SLO/cost constraints. It combines techniques like dynamic batching, queueing, priority and preemption, model placement, and autoscaling to meet latency/throughput targets while maximizing utilization and respecting power/thermal and budget limits.
Workflow / Steps
- Profile tasks and models: latency SLOs, batchability, memory footprints (KV cache/weights), throughput targets.
- Instrument resources: collect CPU/GPU utilization, memory, queue depth, temperature, power, and cost signals.
- Admission control & routing: classify request priority/tier; route to suitable model/instance/region.
- Scheduling & batching: form dynamic batches within SLO; apply priority, preemption, or isolation where needed.
- Autoscaling & placement: scale replicas by queue/latency; place models to satisfy memory/affinity constraints.
- Feedback control: monitor SLO attainment; adapt batch size/concurrency, and adjust priorities/budgets.
Best Practices
When NOT to Use
- Ultra-simplified/homogeneous workloads where static round-robin meets SLOs cheaply.
- Hard real-time control loops where any batching/queuing overhead risks deadline misses.
- Single-tenant, overprovisioned environments with negligible contention and stable load.
Common Pitfalls
- Over-batching causing p95/p99 latency regressions; ignoring per-tier SLO budgets.
- Insufficient GPU memory accounting for KV cache/token growth โ OOMs and restarts.
- Hotspot queues and head-of-line blocking; lack of isolation between tiers/tenants.
- Reactive-only autoscaling with cold start penalties; no warm capacity for bursts.
- Poor observability: missing queue time vs. inference time breakdown; blind tuning.
Key Features
KPIs / Success Metrics
- Latency: p50/p95 end-to-end and queue time share; SLO attainment rate per tier.
- Throughput: requests/s and tokens/s per instance and fleet; batch effectiveness.
- Utilization: GPU SM/memory utilization, CPU utilization, saturation indicators.
- Cost and efficiency: cost/request, tokens per dollar, energy per token.
- Stability: error rate, OOM/restart rate, autoscale convergence time.
Token / Resource Usage
- Budget input/output tokens per tier; enforce max context and apply compression/summarization.
- Track KV cache growth vs. batch size and sequence length; reserve headroom to prevent OOM.
- Use continuous/dynamic batching to improve tokens/s while honoring per-request deadlines.
- Prefer streaming for tight SLOs; cut off long generations or down-tier models on budget pressure.
Best Use Cases
- Multi-tenant LLM APIs with tiered latency/cost SLOs and bursty demand.
- Real-time assistants and chat where TTFT and p95 latency are critical.
- Batch/stream inference services in cloud and Kubernetes with autoscaling.
- Edge/mobile deployments balancing battery, thermal limits, and offline modes.
- Autonomous systems requiring deadline-aware perception/planning pipelines.
References & Further Reading
Academic Papers
Implementation Guides
Tools & Libraries
- NVIDIA Triton Inference Server, vLLM, Ray Serve, TorchServe
- Prometheus / Grafana, NVIDIA DCGM for telemetry
- Kubernetes, Istio for traffic shaping