Design Patterns & Techniques

🔗

Prompt Chaining

🔀

Routing

⚡

Parallelization

🪞

Reflection

🔧

Tool Use

🎯

Planning

👥

Multi-Agent

🧠

Memory Management

📈

Learning and Adaptation

🏗️

Fault Tolerance Infrastructure

📚

Knowledge Retrieval (RAG)

🧠

Reasoning Techniques

🔐

Security & Privacy Patterns

📊

Evaluation and Monitoring

🧠

Context Management

🎨

UI/UX & Human-AI Interaction

Loading...

📱

Edge AI Optimization(EAO)

Optimizes AI workflows for resource-constrained edge devices and mobile environments

Complexity: highPattern

Core Mechanism

Optimize models and inference pipelines to run within edge device constraints (CPU/GPU/NPU, memory, battery, thermals) using compression (quantization, pruning, distillation), hardware-specific compilation, and system techniques (operator fusion, memory planning, scheduling). Targets include Android NNAPI, Apple Core ML/Metal, NVIDIA TensorRT (Jetson), Intel OpenVINO, ONNX Runtime Mobile, TensorFlow Lite, TVM, and specialized NPUs (Google Edge TPU, Arm Ethos‑U).

Workflow / Steps

Define device targets and budgets (p95 latency, energy per inference, RAM peak, binary size, accuracy floor).
Choose an edge-suitable baseline (MobileNet/EfficientNet-Lite, MobileViT, tiny transformer/CRNN; quant-friendly layers).
Compress the model:
- Post-training quantization (INT8/FP16) with proper calibration; use QAT when PTQ drop is high.
- Structured pruning/sparsity to reduce MACs; leverage hardware sparsity where available.
- Knowledge distillation to transfer from a large teacher to a compact student.
Convert/compile for target: export to ONNX/TFLite/Core ML; compile with TensorRT/OpenVINO/TVM or NNAPI/Core ML backends.
Integrate runtime: ExecuTorch/TFLite/ONNX Runtime Mobile/Core ML; enable accelerators/delegates and preferred precisions.
Tune pipeline: fuse ops, minimize copies, pin buffers, batch where safe, and align pre/post-processing with training.
On-device evaluation: measure p50/p95 latency, energy, RAM peak, accuracy deltas under thermal load.
Deploy with OTA, feature flags, telemetry, and safe fallbacks (degrade quality or offload when needed).

Best Practices

Start with PTQ INT8/FP16; adopt QAT selectively to recover accuracy for sensitive tasks.

Validate operator/kernel support to avoid CPU fallbacks; align opset with accelerator capabilities.

Benchmark on real devices; capture tail latency and thermal throttling over sustained runs.

Minimize memory traffic: fused ops, in-place tensors, arena allocators, zero-copy pre/post-processing.

Use execution providers/delegates (NNAPI, Core ML, TensorRT, OpenVINO) and enable FP16/INT8 kernels.

Thermal-aware scheduling: cap concurrency, dynamically adjust resolution/frame-rate.

Keep data transforms identical between training and inference; validate numerics across toolchains.

Version compiled artifacts per device/OS; include rollback and A/B canaries for OTA updates.

When NOT to Use

Accuracy floors cannot be achieved post-compression within latency/power budgets.
Frequent large model updates exceed feasible OTA bandwidth or device storage.
Deterministic/precision-critical workloads unsupported by mobile accelerators.
Hard real-time constraints beyond device capability without unacceptable thermal impact.

Common Pitfalls

Non-representative INT8 calibration → large accuracy loss in production.
Silent CPU fallbacks due to unsupported ops → severe latency regressions.
Layout/precision mismatches between stages → extra copies and fragmentation.
Ignoring big.LITTLE scheduling and thread affinity → jitter under load.
Short synthetic benchmarks masking thermal throttling and GC pauses.

Key Features

On-device, low-latency inference with offline capability

INT8/FP16 quantization, pruning/sparsity, and knowledge distillation

Accelerated execution via NNAPI/Core ML/Metal, TensorRT, OpenVINO, TVM

Memory- and power-aware scheduling and buffer planning

Adaptive quality: early-exit, conditional compute, resolution/frame-rate scaling

KPIs / Success Metrics

Latency p50/p95 and throughput (FPS/inferences/s).
Energy per inference (mJ) and average power (mW); thermal headroom.
Peak RAM and model binary size; storage footprint.
Accuracy delta vs. FP32 baseline on representative datasets.
Rate of accelerator coverage vs. CPU fallback; offline success rate.

Token / Resource Usage

Prioritize compute, memory, and energy budgets. For non-LLM models, focus on MACs and bandwidth; for on-device LLMs, track context length, KV-cache footprint, precision (fp16/int8), and batch effects.

Enable INT8/FP16 kernels and fused ops to reduce memory traffic.
Use lightweight gating to avoid invoking heavy models unnecessarily.
Adapt concurrency/frame-rate to maintain thermal and battery limits.

Best Use Cases

✅Smart cameras and video analytics (detection, tracking, analytics)

✅On-device speech/keyword spotting and offline ASR/TTS

✅Mobile AR and real-time segmentation/classification

✅Industrial predictive maintenance and anomaly detection

✅Wearables health monitoring with privacy preservation

✅Retail edge analytics in bandwidth-constrained sites

References & Further Reading

📚Academic Papers

• Distillation (Hinton et al.)• MobileNetV3 • EfficientNet • MnasNet

🛠️Implementation Guides

• TensorFlow Lite: Post-training quantization • TensorFlow Model Optimization: QAT • PyTorch ExecuTorch • ONNX Runtime Mobile • ONNX Runtime Quantization • NVIDIA TensorRT Guide • Intel OpenVINO Docs • Core ML Tools • Android NNAPI • Apple MPSGraph • Apache TVM Docs • Google Coral Edge TPU • Arm Ethos‑U

⚙️Tools & Libraries

TensorFlow Lite, PyTorch ExecuTorch, ONNX Runtime Mobile, Core ML, OpenVINO, TensorRT, TVM

Android NNAPI, Apple Neural Engine/Core ML, NVIDIA Jetson, Google Edge TPU, Arm Ethos‑U, CMSIS‑NN

👥Community & Discussions

• TVM Discuss • ONNX Runtime Discussions • NVIDIA TensorRT Forum • TensorFlow Lite Forum

📱

Edge AI Optimization(EAO)

Optimizes AI workflows for resource-constrained edge devices and mobile environments

Complexity: highPattern

Core Mechanism

Workflow / Steps

Define device targets and budgets (p95 latency, energy per inference, RAM peak, binary size, accuracy floor).
Choose an edge-suitable baseline (MobileNet/EfficientNet-Lite, MobileViT, tiny transformer/CRNN; quant-friendly layers).
Compress the model:
- Post-training quantization (INT8/FP16) with proper calibration; use QAT when PTQ drop is high.
- Structured pruning/sparsity to reduce MACs; leverage hardware sparsity where available.
- Knowledge distillation to transfer from a large teacher to a compact student.
Convert/compile for target: export to ONNX/TFLite/Core ML; compile with TensorRT/OpenVINO/TVM or NNAPI/Core ML backends.
Integrate runtime: ExecuTorch/TFLite/ONNX Runtime Mobile/Core ML; enable accelerators/delegates and preferred precisions.
Tune pipeline: fuse ops, minimize copies, pin buffers, batch where safe, and align pre/post-processing with training.
On-device evaluation: measure p50/p95 latency, energy, RAM peak, accuracy deltas under thermal load.
Deploy with OTA, feature flags, telemetry, and safe fallbacks (degrade quality or offload when needed).