Large Language Models

Overview

Large language models (LLMs) predict the next token in a sequence. Trained on vast text corpora, they generate coherent prose, code, summaries, and structured outputs. Modern LLMs use the transformer architecture: self-attention lets each token attend to every other token in the input, capturing long-range dependencies efficiently.

LLMs are general-purpose but not omniscient. They excel at language tasks, pattern completion, and reasoning over provided context. They do not browse the web by default, cannot verify facts against reality, and may confabulate plausible-sounding wrong answers.

Syntax / Usage

Key concepts:

Concept	Description
Token	Subword unit (~4 chars English avg); billing and limits are token-based
Context window	Max tokens in one request (prompt + completion); e.g. 8K–200K
Prompt	Input text/messages the model conditions on
Completion	Model-generated continuation
Temperature	Randomness (0 = deterministic, higher = more varied)
System message	High-level behavior instructions (where supported)

Rough token estimate: len(text) / 4 for English prose. Code and non-Latin scripts vary.

Transformer flow (simplified):

Text → Tokenizer → Token IDs → Embedding layer
  → N × (Self-Attention + Feed-Forward) blocks
  → Output logits → Sample next token → repeat

Examples

Checking context budget before sending docs to an API:

const MAX_CONTEXT = 128_000;
const RESERVED_FOR_REPLY = 4_000;

function fitsInContext(promptTokens: number, docTokens: number): boolean {
  return promptTokens + docTokens + RESERVED_FOR_REPLY <= MAX_CONTEXT;
}

Capabilities vs limits in product design:

Good fit:  draft emails, explain code, transform JSON, classify intent
Risky:     legal/medical advice without review, math without verification
Poor fit:  real-time facts, guaranteed factual accuracy, secret retrieval

Mitigate limits with RAG (retrieval-augmented generation), tool calls, and human review for high-stakes outputs.

Common Mistakes

Stuffing entire codebases into one prompt instead of chunking or RAG
Assuming the model "knows" your private data—it only sees what you send
Ignoring token costs at scale (long contexts × many users add up)
Treating outputs as ground truth without validation
Using the largest model when a smaller, fine-tuned model suffices

Overview

Syntax / Usage

Examples

Common Mistakes

See Also