Patterns

Libraries & Frameworks

Inference Libraries & Frameworks

Essential tools and libraries for implementing AI inference in your applications. From low-level optimization libraries to high-level serving frameworks.

llama.cpp

C++ implementation of LLaMA inference with quantization support

Key Features
CPU optimized
Multiple quantization
Cross-platform
Memory efficient
Language: C++
Ollama

Easy-to-use local model serving built on llama.cpp

Key Features
Simple API
Model library
Docker support
REST API
Language: Go
vLLM

Fast and easy-to-use library for LLM inference and serving

Key Features
PagedAttention
Continuous batching
GPU acceleration
OpenAI compatible
Language: Python
Text Generation Inference

Hugging Face's toolkit for deploying and serving LLMs

Key Features
Production-ready
Optimized kernels
Streaming
Multi-GPU
Language: Python/Rust

Choosing the Right Library

For Local Development
  • • Ollama - Easiest setup and use
  • • llama.cpp - Maximum control and optimization
  • • LM Studio - GUI for beginners
For Production Serving
  • • vLLM - High throughput, GPU optimization
  • • TGI - Enterprise features, scalability
  • • Provider APIs - Managed solutions
For Web Applications
  • • WebLLM - Browser-based inference
  • • BrowserAI - TypeScript support
  • • Transformers.js - Hugging Face models

AI Inference Guide

closed
🧠

Core Concepts

4
🚀

Deployment Options

3
🛠️

Tools & Services

2

Advanced Topics

2
Built by Kortexya