Large Language Models
Transformers, tokens, context windows, and what LLMs can and cannot do
Overview
Large language models (LLMs) predict the next token in a sequence. Trained on vast text corpora, they generate coherent prose, code, summaries, and structured outputs. Modern LLMs use the transformer architecture: self-attention lets each token attend to every other token in the input, capturing long-range dependencies efficiently.
LLMs are general-purpose but not omniscient. They excel at language tasks, pattern completion, and reasoning over provided context. They do not browse the web by default, cannot verify facts against reality, and may confabulate plausible-sounding wrong answers.
Syntax / Usage
Key concepts:
| Concept | Description |
|---|---|
| Token | Subword unit (~4 chars English avg); billing and limits are token-based |
| Context window | Max tokens in one request (prompt + completion); e.g. 8K–200K |
| Prompt | Input text/messages the model conditions on |
| Completion | Model-generated continuation |
| Temperature | Randomness (0 = deterministic, higher = more varied) |
| System message | High-level behavior instructions (where supported) |
Rough token estimate: len(text) / 4 for English prose. Code and non-Latin scripts vary.
Transformer flow (simplified):
Text → Tokenizer → Token IDs → Embedding layer
→ N × (Self-Attention + Feed-Forward) blocks
→ Output logits → Sample next token → repeat
Examples
Checking context budget before sending docs to an API:
const MAX_CONTEXT = 128_000;
const RESERVED_FOR_REPLY = 4_000;
function fitsInContext(promptTokens: number, docTokens: number): boolean {
return promptTokens + docTokens + RESERVED_FOR_REPLY <= MAX_CONTEXT;
}
Capabilities vs limits in product design:
Good fit: draft emails, explain code, transform JSON, classify intent
Risky: legal/medical advice without review, math without verification
Poor fit: real-time facts, guaranteed factual accuracy, secret retrieval
Mitigate limits with RAG (retrieval-augmented generation), tool calls, and human review for high-stakes outputs.
Common Mistakes
- Stuffing entire codebases into one prompt instead of chunking or RAG
- Assuming the model "knows" your private data—it only sees what you send
- Ignoring token costs at scale (long contexts × many users add up)
- Treating outputs as ground truth without validation
- Using the largest model when a smaller, fine-tuned model suffices
See Also
prompt-engineering embeddings rag-basics ai-apis