Fine-Tuning Guide
Getting Started
Methods & Techniques
Implementation
Deployment
Fine-Tuning Cheatsheet
Comprehensive quick reference for fine-tuning LLMs with state-of-the-art techniques, parameters, and best practices for 2025.
Data Requirements
Dataset Size Guidelines
- • Minimum: 100-1000 examples
- • Optimal: 1000-10000 examples
- • Quality > Quantity: 50 high-quality examples can outperform 1000s
- • NLFT breakthrough: 219% improvement with 50 examples
Data Quality Checklist
- • Diverse, representative samples
- • Consistent formatting
- • Proper deduplication
- • Domain-specific relevance
- • Error-free labeling
Data Preprocessing
- • Tokenization normalization
- • Quality filtering pipeline
- • Prompt structure consistency
- • Train/validation split (80/20)
Key Parameters
Core Training Parameters
- • Learning Rate: 1e-5 to 5e-4
- • Batch Size: 1-16 (memory dependent)
- • Epochs: 1-5 (avoid overfitting)
- • Warmup Steps: 10% of total steps
- • Weight Decay: 0.01-0.1
LoRA Parameters
- • LoRA Rank (r): 8-64
- • LoRA Alpha: 16-32
- • LoRA Dropout: 0.1
- • Target Modules: q_proj, v_proj, k_proj, o_proj
- • Use RSLoRA: True (better stability)
QLoRA Quantization
- • 4-bit NormalFloat (NF4): Optimal for normal weights
- • Double Quantization: True
- • Compute Type: bfloat16
- • Paged Optimizers: AdamW 8-bit
Memory Optimization
Memory Efficiency Techniques
- • Gradient Checkpointing: Enable
- • Mixed Precision (FP16): Enable
- • QLoRA 4-bit: Up to 75% memory reduction
- • Gradient Accumulation: 4-8 steps
- • DeepSpeed ZeRO: Stage 2 or 3
Hardware Requirements
- • 7B Model (QLoRA): 12-16GB VRAM
- • 13B Model (QLoRA): 24GB VRAM
- • 70B Model (QLoRA): 48GB VRAM
- • Full Fine-tuning 7B: 60-80GB VRAM
Cost Optimization
- • LoRA: ~13$ vs $322 full fine-tuning
- • Train only 0.19-1.16% of parameters
- • Use model merging for combining capabilities
- • Spot instances for non-critical training
Best Practices
Training Strategy
- • Start with a strong base model (Llama 2/3, Mistral)
- • Use validation sets to monitor overfitting
- • Implement early stopping (patience: 2-3)
- • Save checkpoints regularly
- • Monitor loss curves and metrics continuously
- • Use learning rate scheduling (cosine annealing)
- • Preprocess data consistently
- • Test on held-out data
Production Ready
- • Version control datasets and models
- • Document hyperparameters and results
- • Use experiment tracking (MLflow, W&B)
- • Implement proper evaluation metrics
- • Test edge cases and failure modes
- • Set up monitoring for model drift
- • Plan rollback strategies
- • Secure sensitive training data
Common Pitfalls
Training Issues
- • Using too high learning rates (causes instability)
- • Training for too many epochs (overfitting)
- • Insufficient or biased training data
- • Ignoring data preprocessing quality
- • Not monitoring for overfitting
- • Inconsistent evaluation metrics
- • Forgetting to set random seeds
- • Not testing edge cases
Production Failures
- • Catastrophic forgetting (losing base capabilities)
- • Model drift without monitoring
- • Security vulnerabilities (data poisoning)
- • Inadequate failure recovery plans
- • Poor documentation of changes
- • Missing evaluation on diverse test sets
- • Ignoring bias amplification
- • No performance degradation tracking
Framework Quick Commands
Hugging Face PEFT + LoRA
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, lora_config)QLoRA with BitsAndBytes
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
model_id, quantization_config=bnb_config
)Axolotl Configuration
# axolotl.yml
base_model: meta-llama/Llama-2-7b-hf
model_type: LlamaForCausalLM
adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.1
load_in_4bit: true
strict: false
datasets:
- path: your_dataset.json
type: alpacaUnsloth (2-5x Faster)
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/llama-2-7b-bnb-4bit",
max_seq_length=2048,
dtype=None,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model, r=16, alpha=32,
target_modules=["q_proj", "k_proj", "v_proj"]
)Quick Decision Tree
When to Fine-Tune
- • Domain-specific knowledge needed
- • Consistent tone/style required
- • High-quality labeled data available
- • Complex reasoning improvements needed
- • Cost-effective vs API calls
Consider Alternatives
- • RAG for knowledge integration
- • Prompt engineering for behavior
- • Few-shot learning for examples
- • Tool use for external capabilities
- • Ensemble methods for robustness
Avoid Fine-Tuning If
- • Data quality is poor
- • Dataset is too small (<100 examples)
- • Base model already performs well
- • Requirements change frequently
- • Limited computational resources
Monitoring & Evaluation
Key Metrics to Track
- • Training Loss: Should decrease consistently
- • Validation Loss: Watch for overfitting gap
- • Perplexity: Lower is better for language tasks
- • BLEU/ROUGE: For generation quality
- • Accuracy: For classification tasks
- • F1 Score: For balanced evaluation
- • Inference Speed: Latency requirements
- • Memory Usage: Resource constraints
Production Monitoring
- • Model Drift: Performance degradation over time
- • Data Drift: Input distribution changes
- • Bias Detection: Fairness across groups
- • Error Analysis: Failure pattern tracking
- • Resource Usage: GPU/CPU/memory utilization
- • Response Quality: Human evaluation scores
- • User Feedback: Satisfaction metrics
- • Security Alerts: Adversarial inputs
Key Research Sources
• ArXiv 2408.13296: "The Ultimate Guide to Fine-Tuning LLMs"
• ArXiv 2305.14314: "QLoRA: Efficient Finetuning of Quantized LLMs"
• Hugging Face PEFT Documentation: Parameter-efficient methods
• LoRA Paper: Low-Rank Adaptation of Large Language Models
• DeepSpeed Documentation: Memory optimization techniques
• MLflow Tutorials: Experiment tracking and model management
• Axolotl Framework: Production fine-tuning workflows
• Unsloth Research: 2-5x faster training optimizations
Fine-Tuning Cheatsheet
Comprehensive quick reference for fine-tuning LLMs with state-of-the-art techniques, parameters, and best practices for 2025.
Data Requirements
Dataset Size Guidelines
- • Minimum: 100-1000 examples
- • Optimal: 1000-10000 examples
- • Quality > Quantity: 50 high-quality examples can outperform 1000s
- • NLFT breakthrough: 219% improvement with 50 examples
Data Quality Checklist
- • Diverse, representative samples
- • Consistent formatting
- • Proper deduplication
- • Domain-specific relevance
- • Error-free labeling
Data Preprocessing
- • Tokenization normalization
- • Quality filtering pipeline
- • Prompt structure consistency
- • Train/validation split (80/20)
Key Parameters
Core Training Parameters
- • Learning Rate: 1e-5 to 5e-4
- • Batch Size: 1-16 (memory dependent)
- • Epochs: 1-5 (avoid overfitting)
- • Warmup Steps: 10% of total steps
- • Weight Decay: 0.01-0.1
LoRA Parameters
- • LoRA Rank (r): 8-64
- • LoRA Alpha: 16-32
- • LoRA Dropout: 0.1
- • Target Modules: q_proj, v_proj, k_proj, o_proj
- • Use RSLoRA: True (better stability)
QLoRA Quantization
- • 4-bit NormalFloat (NF4): Optimal for normal weights
- • Double Quantization: True
- • Compute Type: bfloat16
- • Paged Optimizers: AdamW 8-bit
Memory Optimization
Memory Efficiency Techniques
- • Gradient Checkpointing: Enable
- • Mixed Precision (FP16): Enable
- • QLoRA 4-bit: Up to 75% memory reduction
- • Gradient Accumulation: 4-8 steps
- • DeepSpeed ZeRO: Stage 2 or 3
Hardware Requirements
- • 7B Model (QLoRA): 12-16GB VRAM
- • 13B Model (QLoRA): 24GB VRAM
- • 70B Model (QLoRA): 48GB VRAM
- • Full Fine-tuning 7B: 60-80GB VRAM
Cost Optimization
- • LoRA: ~13$ vs $322 full fine-tuning
- • Train only 0.19-1.16% of parameters
- • Use model merging for combining capabilities
- • Spot instances for non-critical training
Best Practices
Training Strategy
- • Start with a strong base model (Llama 2/3, Mistral)
- • Use validation sets to monitor overfitting
- • Implement early stopping (patience: 2-3)
- • Save checkpoints regularly
- • Monitor loss curves and metrics continuously
- • Use learning rate scheduling (cosine annealing)
- • Preprocess data consistently
- • Test on held-out data
Production Ready
- • Version control datasets and models
- • Document hyperparameters and results
- • Use experiment tracking (MLflow, W&B)
- • Implement proper evaluation metrics
- • Test edge cases and failure modes
- • Set up monitoring for model drift
- • Plan rollback strategies
- • Secure sensitive training data
Common Pitfalls
Training Issues
- • Using too high learning rates (causes instability)
- • Training for too many epochs (overfitting)
- • Insufficient or biased training data
- • Ignoring data preprocessing quality
- • Not monitoring for overfitting
- • Inconsistent evaluation metrics
- • Forgetting to set random seeds
- • Not testing edge cases
Production Failures
- • Catastrophic forgetting (losing base capabilities)
- • Model drift without monitoring
- • Security vulnerabilities (data poisoning)
- • Inadequate failure recovery plans
- • Poor documentation of changes
- • Missing evaluation on diverse test sets
- • Ignoring bias amplification
- • No performance degradation tracking
Framework Quick Commands
Hugging Face PEFT + LoRA
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, lora_config)QLoRA with BitsAndBytes
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
model_id, quantization_config=bnb_config
)Axolotl Configuration
# axolotl.yml
base_model: meta-llama/Llama-2-7b-hf
model_type: LlamaForCausalLM
adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.1
load_in_4bit: true
strict: false
datasets:
- path: your_dataset.json
type: alpacaUnsloth (2-5x Faster)
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/llama-2-7b-bnb-4bit",
max_seq_length=2048,
dtype=None,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model, r=16, alpha=32,
target_modules=["q_proj", "k_proj", "v_proj"]
)Quick Decision Tree
When to Fine-Tune
- • Domain-specific knowledge needed
- • Consistent tone/style required
- • High-quality labeled data available
- • Complex reasoning improvements needed
- • Cost-effective vs API calls
Consider Alternatives
- • RAG for knowledge integration
- • Prompt engineering for behavior
- • Few-shot learning for examples
- • Tool use for external capabilities
- • Ensemble methods for robustness
Avoid Fine-Tuning If
- • Data quality is poor
- • Dataset is too small (<100 examples)
- • Base model already performs well
- • Requirements change frequently
- • Limited computational resources
Monitoring & Evaluation
Key Metrics to Track
- • Training Loss: Should decrease consistently
- • Validation Loss: Watch for overfitting gap
- • Perplexity: Lower is better for language tasks
- • BLEU/ROUGE: For generation quality
- • Accuracy: For classification tasks
- • F1 Score: For balanced evaluation
- • Inference Speed: Latency requirements
- • Memory Usage: Resource constraints
Production Monitoring
- • Model Drift: Performance degradation over time
- • Data Drift: Input distribution changes
- • Bias Detection: Fairness across groups
- • Error Analysis: Failure pattern tracking
- • Resource Usage: GPU/CPU/memory utilization
- • Response Quality: Human evaluation scores
- • User Feedback: Satisfaction metrics
- • Security Alerts: Adversarial inputs
Key Research Sources
• ArXiv 2408.13296: "The Ultimate Guide to Fine-Tuning LLMs"
• ArXiv 2305.14314: "QLoRA: Efficient Finetuning of Quantized LLMs"
• Hugging Face PEFT Documentation: Parameter-efficient methods
• LoRA Paper: Low-Rank Adaptation of Large Language Models
• DeepSpeed Documentation: Memory optimization techniques
• MLflow Tutorials: Experiment tracking and model management
• Axolotl Framework: Production fine-tuning workflows
• Unsloth Research: 2-5x faster training optimizations