Concept: Basic LLM Interaction

Overview

This example introduces the fundamental concepts of working with a Large Language Model (LLM) running locally on your machine. It demonstrates the simplest possible interaction: loading a model and asking it a question.

What is a Local LLM?

A Local LLM is an AI language model that runs entirely on your own computer, without requiring internet connectivity or external API calls. Key benefits:

Privacy: Your data never leaves your machine
Cost: No per-token API charges
Control: Full control over model selection and parameters
Offline: Works without internet connection

Core Components

1. Model Files (GGUF Format)

┌─────────────────────────────┐
│   Qwen3-1.7B-Q8_0.gguf     │
│   (Model Weights File)      │
│                             │
│  • Stores learned patterns  │
│  • Quantized for efficiency │
│  • Loaded into RAM/VRAM     │
└─────────────────────────────┘

GGUF: File format optimized for llama.cpp
Quantization: Reduces model size (e.g., 8-bit instead of 16-bit)
Trade-off: Smaller size and faster speed vs. slight quality loss

2. The Inference Pipeline

User Input → Model → Generation → Response
    ↓          ↓          ↓           ↓
 "Hello"   Context   Sampling    "Hi there!"

Flow Diagram:

┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
│  Prompt  │ --> │ Context  │ --> │  Model   │ --> │ Response │
│          │     │ (Memory) │     │(Weights) │     │  (Text)  │
└──────────┘     └──────────┘     └──────────┘     └──────────┘

3. Context Window

The context is the model's working memory:

┌─────────────────────────────────────────┐
│           Context Window                │
│  ┌─────────────────────────────────┐   │
│  │ System Prompt (if any)          │   │
│  ├─────────────────────────────────┤   │
│  │ User: "do you know node-llama?" │   │
│  ├─────────────────────────────────┤   │
│  │ AI: "Yes, I'm familiar..."      │   │
│  ├─────────────────────────────────┤   │
│  │ (Space for more conversation)   │   │
│  └─────────────────────────────────┘   │
└─────────────────────────────────────────┘

Limited size (e.g., 2048, 4096, or 8192 tokens)
When full, old messages must be removed
All previous messages influence the next response

How LLMs Generate Responses

Token-by-Token Generation

LLMs don't generate entire sentences at once. They predict one token (word piece) at a time:

Prompt: "What is AI?"

Generation Process:
"What is AI?" → [Model] → "AI"
"What is AI? AI" → [Model] → "is"
"What is AI? AI is" → [Model] → "a"
"What is AI? AI is a" → [Model] → "field"
... continues until stop condition

Visualization:

Input Prompt
     ↓
┌────────────┐
│   Model    │ → Token 1: "AI"
│ Processes  │ → Token 2: "is"
│   & Predicts│ → Token 3: "a"
└────────────┘ → Token 4: "field"
                → ...

Key Concepts for AI Agents

1. Stateless Processing

Each prompt is independent unless you maintain context
The model has no memory between different script runs
To build an "agent", you need to:
- Keep the context alive between prompts
- Maintain conversation history
- Add tools/functions (covered in later examples)

2. Prompt Engineering Basics

The way you phrase questions affects the response:

❌ Poor: "node-llama-cpp"
✅ Better: "do you know node-llama-cpp"
✅ Best: "Explain what node-llama-cpp is and how it works"

3. Resource Management

LLMs consume significant resources:

Model Loading
     ↓
┌─────────────────┐
│  RAM/VRAM Usage │  ← Models need gigabytes
│  CPU/GPU Time   │  ← Inference takes time
│  Memory Leaks?  │  ← Must cleanup properly
└─────────────────┘
     ↓
Proper Disposal

Why This Matters for Agents

This basic example establishes the foundation for AI agents:

Agents need LLMs to "think": The model processes information and generates responses
Agents need context: To maintain state across interactions
Agents need structure: Later examples add tools, memory, and reasoning loops

Next Steps

After understanding basic prompting, explore:

System prompts: Giving the model a specific role or behavior
Function calling: Allowing the model to use tools
Memory: Persisting information across sessions
Reasoning patterns: Like ReAct (Reasoning + Acting)

Diagram: Complete Architecture

┌──────────────────────────────────────────────────┐
│            Your Application                      │
│  ┌────────────────────────────────────────────┐ │
│  │         node-llama-cpp Library             │ │
│  │  ┌──────────────────────────────────────┐  │ │
│  │  │      llama.cpp (C++ Runtime)         │  │ │
│  │  │  ┌────────────────────────────────┐  │  │ │
│  │  │  │   Model File (GGUF)            │  │  │ │
│  │  │  │   • Qwen3-1.7B-Q8_0.gguf       │  │  │ │
│  │  │  └────────────────────────────────┘  │  │ │
│  │  └──────────────────────────────────────┘  │ │
│  └────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘
           ↕
    ┌──────────────┐
    │  CPU / GPU   │
    └──────────────┘

This layered architecture allows you to build sophisticated AI agents on top of basic LLM interactions.