Stream tokens in real time and enforce hard token budgets.
This example demonstrates streaming responses and token limits, two essential techniques for building responsive AI agents with controlled output.
User sends prompt
↓
[Wait 10 seconds...]
↓
Complete response appears all at once
Problems:
User sends prompt
↓
"Hoisting" (0.1s) → User sees first word!
↓
"is a" (0.2s) → More text appears
↓
"JavaScript" (0.3s) → Continuous feedback
↓
[Continues token by token...]
Benefits:
LLMs generate one token at a time internally. Streaming exposes this:
Internal LLM Process:
┌─────────────────────────────────────┐
│ Token 1: "Hoisting" │
│ Token 2: "is" │
│ Token 3: "a" │
│ Token 4: "JavaScript" │
│ Token 5: "mechanism" │
│ ... │
└─────────────────────────────────────┘
Without Streaming: With Streaming:
Wait for all tokens Emit each token immediately
└─→ Buffer → Return └─→ Callback → Display
┌────────────────────────────────────┐
│ Model Generation │
└────────────┬───────────────────────┘
│
┌────────┴─────────┐
│ Each new token │
└────────┬─────────┘
↓
┌────────────────────┐
│ onTextChunk(text) │ ← Your callback
└────────┬───────────┘
↓
Your code processes it:
• Display to user
• Send over network
• Log to file
• Analyze content
Without limits, models might generate:
User: "Explain hoisting"
Model: [Generates 10,000 words including:
- Complete JavaScript history
- Every edge case
- Unrelated examples
- Never stops...]
With limits:
User: "Explain hoisting"
Model: [Generates ~1500 words
- Core concept
- Key examples
- Stops at 2000 tokens]
Context Window: 4096 tokens
├─ System Prompt: 200 tokens
├─ User Message: 100 tokens
├─ Response (maxTokens): 2000 tokens
└─ Remaining for history: 1796 tokens
Total used: 2300 tokens
Available: 1796 tokens for future conversation
Token Limit Output Quality Use Case
─────────── ─────────────── ─────────────────
100 Brief, may be cut Quick answers
500 Concise but complete Short explanations
2000 (example) Detailed Full explanations
No limit Risk of rambling When length unknown
User: "Explain closures"
↓
Terminal: "A closure is a function..."
(Appears word by word, like typing)
↓
User sees progress, knows it's working
Browser Server
│ │
├─── Send prompt ────────→│
│ │
│←── Chunk 1: "Closures"──┤
│ (Display immediately) │
│ │
│←── Chunk 2: "are"───────┤
│ (Append to display) │
│ │
│←── Chunk 3: "functions"─┤
│ (Keep appending...) │
Implementation:
onTextChunk(text)
│
┌───────┼───────┐
↓ ↓ ↓
Console WebSocket Log File
Display → Client → Storage
Time to First Token (TTFT):
├─ Small model (1.7B): ~100ms
├─ Medium model (8B): ~200ms
└─ Large model (20B): ~500ms
Tokens Per Second:
├─ Small model: 50-80 tok/s
├─ Medium model: 20-35 tok/s
└─ Large model: 10-15 tok/s
User Experience:
TTFT < 500ms → Feels instant
Tok/s > 20 → Reads naturally
Model Size Memory Speed Quality
────────── ──────── ───── ───────
1.7B ~2GB Fast Good
8B ~6GB Medium Better
20B ~12GB Slower Best
No Buffer (Immediate)
Every token → callback → display
└─ Smoothest UX but more overhead
Line Buffer
Accumulate until newline → flush
└─ Better for paragraph-based output
Time Buffer
Accumulate for 50ms → flush batch
└─ Reduces callback frequency
Generation in progress:
"The answer is clearly... wait, actually..."
↑
onTextChunk detects issue
↓
Stop generation
↓
"Let me reconsider"
Useful for:
Partial Response Analysis:
┌─────────────────────────────────┐
│ "To implement this feature..." │
│ │
│ ← Already useful information │
│ │
│ "...you'll need: 1) Node.js" │
│ │
│ ← Can start acting on this │
│ │
│ "2) Express framework" │
└─────────────────────────────────┘
Agent can begin working before response completes!
┌────────────────────────────────┐
│ Context Window (4096) │
├────────────────────────────────┤
│ System Prompt 200 tokens │
│ Conversation History 1000 │
│ Current Prompt 100 │
│ Response Space 2796 │
└────────────────────────────────┘
If maxTokens > 2796:
└─→ Error or truncation!
Available = contextSize - (prompt + history)
if (maxTokens > available) {
maxTokens = available;
// or clear old history
}
User → LLM (streaming) → Display
└─ onTextChunk shows progress
Step 1: Plan (stream) → Show thinking
Step 2: Act (stream) → Show action
Step 3: Result (stream) → Show outcome
└─ User sees agent's process
Agent A (streaming) ──┐
├─→ Coordinator → User
Agent B (streaming) ──┘
└─ Both stream simultaneously
✓ Good:
session.prompt(query, { maxTokens: 2000 })
✗ Risky:
session.prompt(query)
└─ May use entire context!
let fullResponse = '';
onTextChunk: (chunk) => {
fullResponse += chunk;
display(chunk); // Show immediately
logComplete = false; // Mark incomplete
}
// After completion:
saveToDatabase(fullResponse);
onTextChunk: (chunk) => {
if (firstChunk) {
showLoadingDone();
firstChunk = false;
}
appendToDisplay(chunk);
}
const startTime = Date.now();
let tokenCount = 0;
onTextChunk: (chunk) => {
tokenCount += estimateTokens(chunk);
const elapsed = (Date.now() - startTime) / 1000;
const tokensPerSecond = tokenCount / elapsed;
updateMetrics(tokensPerSecond);
}
Feature intro.js coding.js (this)
──────────────── ───────── ─────────────────
Streaming ✗ ✓
Token limit ✗ ✓ (2000)
Real-time output ✗ ✓
Progress visible ✗ ✓
User control ✗ ✓
This pattern is foundational for building responsive, user-friendly AI agent interfaces.