stackademic

The leading education platform for anyone with an interest in software development.

Large Language Models

Transformers, tokens, context windows, and what LLMs can and cannot do

Overview

Large language models (LLMs) predict the next token in a sequence. Trained on vast text corpora, they generate coherent prose, code, summaries, and structured outputs. Modern LLMs use the transformer architecture: self-attention lets each token attend to every other token in the input, capturing long-range dependencies efficiently.

LLMs are general-purpose but not omniscient. They excel at language tasks, pattern completion, and reasoning over provided context. They do not browse the web by default, cannot verify facts against reality, and may confabulate plausible-sounding wrong answers.

Syntax / Usage

Key concepts:

ConceptDescription
TokenSubword unit (~4 chars English avg); billing and limits are token-based
Context windowMax tokens in one request (prompt + completion); e.g. 8K–200K
PromptInput text/messages the model conditions on
CompletionModel-generated continuation
TemperatureRandomness (0 = deterministic, higher = more varied)
System messageHigh-level behavior instructions (where supported)

Rough token estimate: len(text) / 4 for English prose. Code and non-Latin scripts vary.

Transformer flow (simplified):

Text → Tokenizer → Token IDs → Embedding layer
  → N × (Self-Attention + Feed-Forward) blocks
  → Output logits → Sample next token → repeat

Examples

Checking context budget before sending docs to an API:

const MAX_CONTEXT = 128_000;
const RESERVED_FOR_REPLY = 4_000;

function fitsInContext(promptTokens: number, docTokens: number): boolean {
  return promptTokens + docTokens + RESERVED_FOR_REPLY <= MAX_CONTEXT;
}

Capabilities vs limits in product design:

Good fit:  draft emails, explain code, transform JSON, classify intent
Risky:     legal/medical advice without review, math without verification
Poor fit:  real-time facts, guaranteed factual accuracy, secret retrieval

Mitigate limits with RAG (retrieval-augmented generation), tool calls, and human review for high-stakes outputs.

Common Mistakes

  • Stuffing entire codebases into one prompt instead of chunking or RAG
  • Assuming the model "knows" your private data—it only sees what you send
  • Ignoring token costs at scale (long contexts × many users add up)
  • Treating outputs as ground truth without validation
  • Using the largest model when a smaller, fine-tuned model suffices

See Also

prompt-engineering embeddings rag-basics ai-apis