Lesson 1 of 14 · basic llm

Introduction

Load a local LLM and run your first prompt/response cycle.

model loadingcontext windowinferencetoken generation

Concept: Basic LLM Interaction

Overview

This example introduces the fundamental concepts of working with a Large Language Model (LLM) running locally on your machine. It demonstrates the simplest possible interaction: loading a model and asking it a question.

What is a Local LLM?

A Local LLM is an AI language model that runs entirely on your own computer, without requiring internet connectivity or external API calls. Key benefits:

  • Privacy: Your data never leaves your machine
  • Cost: No per-token API charges
  • Control: Full control over model selection and parameters
  • Offline: Works without internet connection

Core Components

1. Model Files (GGUF Format)

┌─────────────────────────────┐
│   Qwen3-1.7B-Q8_0.gguf     │
│   (Model Weights File)      │
│                             │
│  • Stores learned patterns  │
│  • Quantized for efficiency │
│  • Loaded into RAM/VRAM     │
└─────────────────────────────┘
  • GGUF: File format optimized for llama.cpp
  • Quantization: Reduces model size (e.g., 8-bit instead of 16-bit)
  • Trade-off: Smaller size and faster speed vs. slight quality loss

2. The Inference Pipeline

User Input → Model → Generation → Response
    ↓          ↓          ↓           ↓
 "Hello"   Context   Sampling    "Hi there!"

Flow Diagram:

┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
│  Prompt  │ --> │ Context  │ --> │  Model   │ --> │ Response │
│          │     │ (Memory) │     │(Weights) │     │  (Text)  │
└──────────┘     └──────────┘     └──────────┘     └──────────┘

3. Context Window

The context is the model's working memory:

┌─────────────────────────────────────────┐
│           Context Window                │
│  ┌─────────────────────────────────┐   │
│  │ System Prompt (if any)          │   │
│  ├─────────────────────────────────┤   │
│  │ User: "do you know node-llama?" │   │
│  ├─────────────────────────────────┤   │
│  │ AI: "Yes, I'm familiar..."      │   │
│  ├─────────────────────────────────┤   │
│  │ (Space for more conversation)   │   │
│  └─────────────────────────────────┘   │
└─────────────────────────────────────────┘
  • Limited size (e.g., 2048, 4096, or 8192 tokens)
  • When full, old messages must be removed
  • All previous messages influence the next response

How LLMs Generate Responses

Token-by-Token Generation

LLMs don't generate entire sentences at once. They predict one token (word piece) at a time:

Prompt: "What is AI?"

Generation Process:
"What is AI?" → [Model] → "AI"
"What is AI? AI" → [Model] → "is"
"What is AI? AI is" → [Model] → "a"
"What is AI? AI is a" → [Model] → "field"
... continues until stop condition

Visualization:

Input Prompt
     ↓
┌────────────┐
│   Model    │ → Token 1: "AI"
│ Processes  │ → Token 2: "is"
│   & Predicts│ → Token 3: "a"
└────────────┘ → Token 4: "field"
                → ...

Key Concepts for AI Agents

1. Stateless Processing

  • Each prompt is independent unless you maintain context
  • The model has no memory between different script runs
  • To build an "agent", you need to:
    • Keep the context alive between prompts
    • Maintain conversation history
    • Add tools/functions (covered in later examples)

2. Prompt Engineering Basics

The way you phrase questions affects the response:

❌ Poor: "node-llama-cpp"
✅ Better: "do you know node-llama-cpp"
✅ Best: "Explain what node-llama-cpp is and how it works"

3. Resource Management

LLMs consume significant resources:

Model Loading
     ↓
┌─────────────────┐
│  RAM/VRAM Usage │  ← Models need gigabytes
│  CPU/GPU Time   │  ← Inference takes time
│  Memory Leaks?  │  ← Must cleanup properly
└─────────────────┘
     ↓
Proper Disposal

Why This Matters for Agents

This basic example establishes the foundation for AI agents:

  1. Agents need LLMs to "think": The model processes information and generates responses
  2. Agents need context: To maintain state across interactions
  3. Agents need structure: Later examples add tools, memory, and reasoning loops

Next Steps

After understanding basic prompting, explore:

  • System prompts: Giving the model a specific role or behavior
  • Function calling: Allowing the model to use tools
  • Memory: Persisting information across sessions
  • Reasoning patterns: Like ReAct (Reasoning + Acting)

Diagram: Complete Architecture

┌──────────────────────────────────────────────────┐
│            Your Application                      │
│  ┌────────────────────────────────────────────┐ │
│  │         node-llama-cpp Library             │ │
│  │  ┌──────────────────────────────────────┐  │ │
│  │  │      llama.cpp (C++ Runtime)         │  │ │
│  │  │  ┌────────────────────────────────┐  │  │ │
│  │  │  │   Model File (GGUF)            │  │  │ │
│  │  │  │   • Qwen3-1.7B-Q8_0.gguf       │  │  │ │
│  │  │  └────────────────────────────────┘  │  │ │
│  │  └──────────────────────────────────────┘  │ │
│  └────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘
           ↕
    ┌──────────────┐
    │  CPU / GPU   │
    └──────────────┘

This layered architecture allows you to build sophisticated AI agents on top of basic LLM interactions.