Load a local LLM and run your first prompt/response cycle.
This example introduces the fundamental concepts of working with a Large Language Model (LLM) running locally on your machine. It demonstrates the simplest possible interaction: loading a model and asking it a question.
A Local LLM is an AI language model that runs entirely on your own computer, without requiring internet connectivity or external API calls. Key benefits:
┌─────────────────────────────┐
│ Qwen3-1.7B-Q8_0.gguf │
│ (Model Weights File) │
│ │
│ • Stores learned patterns │
│ • Quantized for efficiency │
│ • Loaded into RAM/VRAM │
└─────────────────────────────┘
User Input → Model → Generation → Response
↓ ↓ ↓ ↓
"Hello" Context Sampling "Hi there!"
Flow Diagram:
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Prompt │ --> │ Context │ --> │ Model │ --> │ Response │
│ │ │ (Memory) │ │(Weights) │ │ (Text) │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
The context is the model's working memory:
┌─────────────────────────────────────────┐
│ Context Window │
│ ┌─────────────────────────────────┐ │
│ │ System Prompt (if any) │ │
│ ├─────────────────────────────────┤ │
│ │ User: "do you know node-llama?" │ │
│ ├─────────────────────────────────┤ │
│ │ AI: "Yes, I'm familiar..." │ │
│ ├─────────────────────────────────┤ │
│ │ (Space for more conversation) │ │
│ └─────────────────────────────────┘ │
└─────────────────────────────────────────┘
LLMs don't generate entire sentences at once. They predict one token (word piece) at a time:
Prompt: "What is AI?"
Generation Process:
"What is AI?" → [Model] → "AI"
"What is AI? AI" → [Model] → "is"
"What is AI? AI is" → [Model] → "a"
"What is AI? AI is a" → [Model] → "field"
... continues until stop condition
Visualization:
Input Prompt
↓
┌────────────┐
│ Model │ → Token 1: "AI"
│ Processes │ → Token 2: "is"
│ & Predicts│ → Token 3: "a"
└────────────┘ → Token 4: "field"
→ ...
The way you phrase questions affects the response:
❌ Poor: "node-llama-cpp"
✅ Better: "do you know node-llama-cpp"
✅ Best: "Explain what node-llama-cpp is and how it works"
LLMs consume significant resources:
Model Loading
↓
┌─────────────────┐
│ RAM/VRAM Usage │ ← Models need gigabytes
│ CPU/GPU Time │ ← Inference takes time
│ Memory Leaks? │ ← Must cleanup properly
└─────────────────┘
↓
Proper Disposal
This basic example establishes the foundation for AI agents:
After understanding basic prompting, explore:
┌──────────────────────────────────────────────────┐
│ Your Application │
│ ┌────────────────────────────────────────────┐ │
│ │ node-llama-cpp Library │ │
│ │ ┌──────────────────────────────────────┐ │ │
│ │ │ llama.cpp (C++ Runtime) │ │ │
│ │ │ ┌────────────────────────────────┐ │ │ │
│ │ │ │ Model File (GGUF) │ │ │ │
│ │ │ │ • Qwen3-1.7B-Q8_0.gguf │ │ │ │
│ │ │ └────────────────────────────────┘ │ │ │
│ │ └──────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘
↕
┌──────────────┐
│ CPU / GPU │
└──────────────┘
This layered architecture allows you to build sophisticated AI agents on top of basic LLM interactions.