Patterns
๐Ÿง 

LLM Checkpoint Recovery (Mnemosyne)(LCR)

Lightweight device proxy architecture for LLM fault recovery with just-in-time checkpointing and partial topology reconstruction

Complexity: highFault Tolerance Infrastructure

๐ŸŽฏ 30-Second Overview

Pattern: Device proxy architecture with just-in-time checkpointing for LLM fault recovery

Why: Reduces recovery overhead by 58.8% vs traditional approaches, enables partial topology reconstruction

Key Insight: Lightweight device proxies + flexible CCL + incremental communication reinitialization = 3.6% daily overhead

โšก Quick Implementation

1Setup Proxy:Deploy device proxy layer for error interception
2Configure CCL:Initialize flexible collective communication library
3Enable JIT:Activate just-in-time checkpointing triggers
4Partial Recovery:Implement incremental topology reconstruction
5Monitor:Track failure patterns and recovery times
Example: device_proxy โ†’ failure_detection โ†’ checkpoint_trigger โ†’ partial_reconstruction โ†’ resume_training

๐Ÿ“‹ Do's & Don'ts

โœ…Use lightweight device proxies optimized for fault tolerance
โœ…Implement partial topology reconstruction around failed nodes
โœ…Cache gradient states and optimizer checkpoints separately
โœ…Use incremental communication reinitialization (not full restart)
โœ…Monitor memory usage patterns to predict failures
โŒRely on elastic training features for pure fault tolerance
โŒPerform global communication reinitialization on single failures
โŒStore massive checkpoints to slow storage during training
โŒIgnore temporal dominance of communication overhead
โŒUse generic checkpointing for 70B+ parameter models

๐Ÿšฆ When to Use

Use When

  • โ€ข Distributed LLM training (7B+ parameters)
  • โ€ข Multi-week training cycles
  • โ€ข High failure rate environments
  • โ€ข Limited checkpoint storage bandwidth

Avoid When

  • โ€ข Small model training (<1B parameters)
  • โ€ข Single-node deployments
  • โ€ข Stable hardware environments
  • โ€ข Short training cycles (<24h)

๐Ÿ“Š Key Metrics

Recovery Time
Seconds to resume training
Overhead
% daily training time lost
Checkpoint Size
GB per model snapshot
Failure Detection
Time to detect node failure
Memory Utilization
% peak memory preserved
Communication Cost
Bandwidth for reconstruction

๐Ÿ’ก Top Use Cases

LLM Pre-training: 70B+ parameter models with week-long training cycles
Distributed Fine-tuning: Multi-node adaptation with frequent hardware failures
MoE Training: Sparse mixture-of-experts with massive checkpoint sizes
Hybrid Parallelism: Data/tensor/pipeline parallel combinations with complex failure modes
Cloud Training: Spot instance training with predictable interruptions

References & Further Reading

Deepen your understanding with these curated resources

Contribute to this collection

Know a great resource? Submit a pull request to add it.

Contribute

Patterns

closed

Loading...

Built by Kortexya