Loading...
LLM Checkpoint Recovery (Mnemosyne)(LCR)
Lightweight device proxy architecture for LLM fault recovery with just-in-time checkpointing and partial topology reconstruction
๐ฏ 30-Second Overview
Pattern: Device proxy architecture with just-in-time checkpointing for LLM fault recovery
Why: Reduces recovery overhead by 58.8% vs traditional approaches, enables partial topology reconstruction
Key Insight: Lightweight device proxies + flexible CCL + incremental communication reinitialization = 3.6% daily overhead
โก Quick Implementation
๐ Do's & Don'ts
๐ฆ When to Use
Use When
- โข Distributed LLM training (7B+ parameters)
- โข Multi-week training cycles
- โข High failure rate environments
- โข Limited checkpoint storage bandwidth
Avoid When
- โข Small model training (<1B parameters)
- โข Single-node deployments
- โข Stable hardware environments
- โข Short training cycles (<24h)
๐ Key Metrics
๐ก Top Use Cases
References & Further Reading
Deepen your understanding with these curated resources
Core Academic Research (2024)
Mnemosyne: Lightweight and Fast Error Recovery for LLM Training (Asia-Pacific Workshop on Networking 2024)
Efficient Training of Large Language Models on Distributed Infrastructures: A Survey (July 2024)
Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing (August 2024)
MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training (2024)
Technical Foundations
Contribute to this collection
Know a great resource? Submit a pull request to add it.
LLM Checkpoint Recovery (Mnemosyne)(LCR)
Lightweight device proxy architecture for LLM fault recovery with just-in-time checkpointing and partial topology reconstruction
๐ฏ 30-Second Overview
Pattern: Device proxy architecture with just-in-time checkpointing for LLM fault recovery
Why: Reduces recovery overhead by 58.8% vs traditional approaches, enables partial topology reconstruction
Key Insight: Lightweight device proxies + flexible CCL + incremental communication reinitialization = 3.6% daily overhead
โก Quick Implementation
๐ Do's & Don'ts
๐ฆ When to Use
Use When
- โข Distributed LLM training (7B+ parameters)
- โข Multi-week training cycles
- โข High failure rate environments
- โข Limited checkpoint storage bandwidth
Avoid When
- โข Small model training (<1B parameters)
- โข Single-node deployments
- โข Stable hardware environments
- โข Short training cycles (<24h)
๐ Key Metrics
๐ก Top Use Cases
References & Further Reading
Deepen your understanding with these curated resources
Core Academic Research (2024)
Mnemosyne: Lightweight and Fast Error Recovery for LLM Training (Asia-Pacific Workshop on Networking 2024)
Efficient Training of Large Language Models on Distributed Infrastructures: A Survey (July 2024)
Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing (August 2024)
MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training (2024)
Technical Foundations
Contribute to this collection
Know a great resource? Submit a pull request to add it.