Agentic Design

Patterns
๐Ÿง 

Memory Optimization(MO)

Efficiently manages memory usage through caching, compression, and garbage collection strategies

Complexity: highPattern

Core Mechanism

Memory Optimization for LLMs reduces peak and steady-state memory by combining quantization (weights/KV cache), efficient attention (paged KV cache), activation/gradient checkpointing, sharding/offloading, and context compression. The goal is to fit larger models and longer sequences on limited hardware while maintaining throughput, latency, and accuracy targets.

Workflow / Steps

  1. Profile memory: weights, optimizer states, activations, and KV cache growth vs. sequence length and batch size.
  2. Right-size precision: apply weight quantization (e.g., 8-bit/4-bit) and mixed precision for compute.
  3. Optimize attention memory: enable paged KV cache and sliding windows; cap max generation/context.
  4. Shard and offload: use ZeRO/FSDP, tensor/pipeline parallelism; offload weights/optimizer/KV to CPU/NVMe when needed.
  5. Reduce activation pressure: gradient checkpointing, recomputation, micro-batching.
  6. Compress context: deduplicate, rerank, summarize; prefer retrieval of spans over full documents.
  7. Validate accuracy and latency; tune batch size, draft length (if speculative), and cache policies.

Best Practices

Use paged KV cache for near-zero waste memory and stable throughput at long sequence lengths.
Prefer 4-bit/8-bit quantization for weights; test AWQ/GPTQ for inference; QLoRA for fine-tuning.
Quantize KV cache where supported (TensorRT-LLM, vLLM options) when quality impact is acceptable.
Enable gradient checkpointing and activation recomputation for training/fine-tuning.
Adopt FSDP/ZeRO to shard weights, gradients, and optimizer state; overlap comms with compute.
Bound sequence lengths; apply sliding windows and summaries for context growth control.
Instrument GPU memory, fragmentation, and cache hit rates; alert on OOM and page thrash.
Store large blobs outside messages; pass references; stream outputs for tight latency budgets.

When NOT to Use

  • Ultra-high-accuracy scenarios where aggressive quantization harms quality beyond acceptable thresholds.
  • Hard real-time systems where recomputation/offloading introduces unacceptable latency jitter.
  • Simple/small models already fitting with ample headroom, where added complexity yields little benefit.

Common Pitfalls

  • Ignoring KV cache growth with long generations โ†’ OOM despite small batch sizes.
  • Applying quantization without calibration/evaluation โ†’ silent accuracy regression.
  • Over-aggressive offloading causing PCIe/NVMe bottlenecks and latency spikes.
  • Fragmentation from frequent alloc/free of variable-length KV blocks without paging.
  • Under-instrumented systems: no visibility into per-request memory budgets or cache hit/miss.

Key Features

Paged KV cache with block-level reuse and eviction
Weight and KV quantization (8-bit / 4-bit options)
Activation/gradient checkpointing and recomputation
Model sharding: tensor, pipeline, and ZeRO/FSDP
Context compression and sliding-window attention
CPU/NVMe offload with prefetch and overlap

KPIs / Success Metrics

  • Peak and average GPU memory usage; fragmentation and headroom.
  • Tokens per second at target context lengths; p50/p95 TTFT and latency.
  • OOM/restart rate; cache hit rate; paging/eviction rate.
  • Quality delta vs. FP16/BF16 baselines after quantization/compression.
  • Cost efficiency: tokens per dollar; energy per token.

Token / Resource Usage

  • Budget max input and output tokens; enforce sliding windows and truncation policies.
  • Track KV cache bytes/token; prefer paged allocations to avoid large contiguous blocks.
  • Use continuous batching and context packing to raise utilization with bounded memory.
  • Favor streaming responses to reduce peak memory and improve perceived latency.

Best Use Cases

  • LLM serving with long contexts and high concurrency on limited-GPU fleets.
  • On-prem/edge deployments with strict VRAM limits requiring quantization and paging.
  • Fine-tuning/LoRA on consumer GPUs using QLoRA, checkpointing, and micro-batching.
  • Multi-tenant platforms balancing cost, quality, and latency via memory-aware scheduling.

References & Further Reading

Patterns

closed

Loading...