AI Inference Examples - Agentic Design

AI Inference Guide

🧠

Core Concepts

🚀

Deployment Options

🛠️

Tools & Services

⚡

Advanced Topics

Code Examples

Getting Started Examples

Practical code examples to help you get started with different inference approaches. Copy and modify these examples for your own projects.

WebLLM Browser Example

import { CreateMLCEngine } from "@mlc-ai/web-llm";

// Initialize the engine
const engine = await CreateMLCEngine(
  "Llama-3.2-1B-Instruct-q4f32_1-MLC",
  { 
    initProgressCallback: (progress) => {
      console.log('Loading:', progress.progress + '%');
    }
  }
);

// Generate text
const response = await engine.chat.completions.create({
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Hello! How are you?" }
  ],
  temperature: 0.8,
  max_tokens: 100
});

console.log(response.choices[0].message.content);

BrowserAI Example

import { BrowserAI } from '@browserai/browserai';

const browserAI = new BrowserAI();

// Load model with progress tracking
await browserAI.loadModel('llama-3.2-1b-instruct', {
  quantization: 'q4f16_1',
  onProgress: (progress) => console.log('Loading:', progress.progress + '%')
});

// Generate text
const response = await browserAI.generateText('Hello, how are you?');
console.log(response.choices[0].message.content);

// Streaming example
const chunks = await browserAI.generateText('Write a story', {
  stream: true,
  temperature: 0.8
});

for await (const chunk of chunks) {
  console.log(chunk.choices[0]?.delta.content || '');
}

Ollama Local Server

# Install and run Ollama
curl -fsSL https://ollama.ai/install.sh | sh
ollama serve

# Pull and run a model
ollama pull llama3.2:1b
ollama run llama3.2:1b "Hello, world!"

# Use the REST API
fetch('http://localhost:11434/api/generate', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    model: 'llama3.2:1b',
    prompt: 'Hello!',
    stream: false
  })
}).then(r => r.json()).then(console.log);

LM Studio Desktop App

// LM Studio provides a local server compatible with OpenAI API
const response = await fetch('http://localhost:1234/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer lm-studio'
  },
  body: JSON.stringify({
    model: 'llama-3.2-1b-instruct',
    messages: [
      { role: 'system', content: 'You are a helpful assistant.' },
      { role: 'user', content: 'Hello!' }
    ],
    temperature: 0.7,
    max_tokens: 100
  })
});

const data = await response.json();
console.log(data.choices[0].message.content);

Provider API Example (Together AI)

import Together from "together-ai";

const together = new Together({
  apiKey: process.env.TOGETHER_API_KEY,
});

const response = await together.chat.completions.create({
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Explain quantum computing in simple terms." }
  ],
  model: "meta-llama/Llama-3.2-3B-Instruct-Turbo",
  max_tokens: 500,
  temperature: 0.7,
  stream: true,
});

// Handle streaming response
for await (const chunk of response) {
  process.stdout.write(chunk.choices[0]?.delta?.content || '');
}

Vision Model Example

import { CreateMLCEngine } from "@mlc-ai/web-llm";

// Load a vision-language model
const engine = await CreateMLCEngine("Llava-1.5-7B-q4f16_1-MLC");

// Process image and text
const response = await engine.chat.completions.create({
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "What do you see in this image?" },
        {
          type: "image_url",
          image_url: { url: "data:image/jpeg;base64,..." }
        }
      ]
    }
  ]
});

console.log(response.choices[0].message.content);

Best Practices

Performance Optimization

• Use quantized models for faster inference
• Implement proper caching strategies
• Optimize batch sizes for throughput
• Monitor memory usage and cleanup

User Experience

• Show loading progress for model downloads
• Implement streaming for long responses
• Provide fallback options
• Handle errors gracefully

Code Examples

Getting Started Examples

Practical code examples to help you get started with different inference approaches. Copy and modify these examples for your own projects.

WebLLM Browser Example

import { CreateMLCEngine } from "@mlc-ai/web-llm";

// Initialize the engine
const engine = await CreateMLCEngine(
  "Llama-3.2-1B-Instruct-q4f32_1-MLC",
  { 
    initProgressCallback: (progress) => {
      console.log('Loading:', progress.progress + '%');
    }
  }
);

// Generate text
const response = await engine.chat.completions.create({
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Hello! How are you?" }
  ],
  temperature: 0.8,
  max_tokens: 100
});

console.log(response.choices[0].message.content);

BrowserAI Example

import { BrowserAI } from '@browserai/browserai';

const browserAI = new BrowserAI();

// Load model with progress tracking
await browserAI.loadModel('llama-3.2-1b-instruct', {
  quantization: 'q4f16_1',
  onProgress: (progress) => console.log('Loading:', progress.progress + '%')
});

// Generate text
const response = await browserAI.generateText('Hello, how are you?');
console.log(response.choices[0].message.content);

// Streaming example
const chunks = await browserAI.generateText('Write a story', {
  stream: true,
  temperature: 0.8
});

for await (const chunk of chunks) {
  console.log(chunk.choices[0]?.delta.content || '');
}

Ollama Local Server

# Install and run Ollama
curl -fsSL https://ollama.ai/install.sh | sh
ollama serve

# Pull and run a model
ollama pull llama3.2:1b
ollama run llama3.2:1b "Hello, world!"

# Use the REST API
fetch('http://localhost:11434/api/generate', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    model: 'llama3.2:1b',
    prompt: 'Hello!',
    stream: false
  })
}).then(r => r.json()).then(console.log);

LM Studio Desktop App

// LM Studio provides a local server compatible with OpenAI API
const response = await fetch('http://localhost:1234/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer lm-studio'
  },
  body: JSON.stringify({
    model: 'llama-3.2-1b-instruct',
    messages: [
      { role: 'system', content: 'You are a helpful assistant.' },
      { role: 'user', content: 'Hello!' }
    ],
    temperature: 0.7,
    max_tokens: 100
  })
});

const data = await response.json();
console.log(data.choices[0].message.content);

Provider API Example (Together AI)

import Together from "together-ai";

const together = new Together({
  apiKey: process.env.TOGETHER_API_KEY,
});

const response = await together.chat.completions.create({
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Explain quantum computing in simple terms." }
  ],
  model: "meta-llama/Llama-3.2-3B-Instruct-Turbo",
  max_tokens: 500,
  temperature: 0.7,
  stream: true,
});

// Handle streaming response
for await (const chunk of response) {
  process.stdout.write(chunk.choices[0]?.delta?.content || '');
}

Vision Model Example

import { CreateMLCEngine } from "@mlc-ai/web-llm";

// Load a vision-language model
const engine = await CreateMLCEngine("Llava-1.5-7B-q4f16_1-MLC");

// Process image and text
const response = await engine.chat.completions.create({
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "What do you see in this image?" },
        {
          type: "image_url",
          image_url: { url: "data:image/jpeg;base64,..." }
        }
      ]
    }
  ]
});

console.log(response.choices[0].message.content);

Best Practices

Performance Optimization

• Use quantized models for faster inference
• Implement proper caching strategies
• Optimize batch sizes for throughput
• Monitor memory usage and cleanup

User Experience

• Show loading progress for model downloads
• Implement streaming for long responses
• Provide fallback options
• Handle errors gracefully

AI Inference Guide

closed

AI Inference Guide

🧠

Core Concepts

🚀

Deployment Options

🛠️

Tools & Services

⚡

Agentic Design

Agentic Design

AI Inference Guide

Core Concepts

Overview

Non-Determinism

Agentic Patterns

Advanced Optimization

Deployment Options

Tools & Services

Advanced Topics

Code Examples

Getting Started Examples

WebLLM Browser Example

BrowserAI Example

Ollama Local Server

LM Studio Desktop App

Provider API Example (Together AI)

Vision Model Example

Best Practices

Performance Optimization

User Experience

Code Examples

Getting Started Examples

WebLLM Browser Example

BrowserAI Example

Ollama Local Server

LM Studio Desktop App

Provider API Example (Together AI)

Vision Model Example

Best Practices

Performance Optimization

User Experience

AI Inference Guide

AI Inference Guide

Core Concepts

Overview

Non-Determinism

Agentic Patterns

Advanced Optimization

Deployment Options

Tools & Services

Advanced Topics