Your AI Is Forgetting Things On Purpose — And That’s Kind of Genius

The hidden math trick that could make ChatGPT-style AI 10x cheaper to run, explained like you’re hearing about it over coffee

Here’s a weird question to start with: why does your AI assistant get slower and more expensive the longer you talk to it?

Ask it one question, it’s snappy. Paste in a 300-page contract and ask it to summarize, and suddenly it’s chugging, the bill goes up, and sometimes it just… refuses. “Context limit exceeded.”

That’s not a bug someone forgot to fix. It’s baked into the math of how almost every major AI model — GPT, Claude, Gemini, all of them — actually works under the hood. And for the last year or so, a small but very serious group of researchers has been quietly building something that sidesteps the problem entirely.

It’s not a new chatbot. It’s a different engine underneath the chatbot. Let’s talk about it like normal humans, no PhD required.

First, the “aha” moment: how does today’s AI even read a sentence?

Imagine you’re reading the sentence:

“The trophy didn’t fit in the suitcase because it was too big.”

To understand what “it” means, your brain quietly glances back at every other word in the sentence and decides: is “it” the trophy or the suitcase? (It’s the trophy, by the way — this is a classic linguistics example.)

The AI models you use today do something similar, and it’s called attention. When the model reads a new word, it looks back at every single word that came before it and asks, “how relevant are you to me right now?” Then it blends everything together based on that relevance.

This idea, introduced in a famous 2017 research paper, is genuinely the reason modern AI got so good so fast. It’s why these models can hold a conversation, write code, and seem to actually “understand” context. Nobody’s arguing attention was a bad idea. It was a brilliant one.

But it has a catch. And the catch has a name mathematicians love to throw around: quadratic scaling.

The catch, explained without a single equation

Picture a group project meeting. If there are 4 people in the room, everyone can quickly chat with everyone else — that’s manageable, maybe 6 total conversations happening.

Now put 100 people in that room and ask everyone to individually talk to everyone else. That’s not 100 conversations. That’s nearly 5,000. Double the people, and the number of conversations doesn’t double — it roughly quadruples.

That’s exactly what happens inside a Transformer (the architecture behind GPT, Claude, Gemini, etc.) when you feed it more text. Every single word has to “compare notes” with every other word. Double your document length, and the computer doesn’t do twice the work — it does close to four times the work.

This is why:

A quick question feels instant.
A giant document feels sluggish and expensive.
There’s a hard ceiling on how much text you can even paste in at once (the “context window”).

Every time a company announces “now with a bigger context window!” — they haven’t solved this problem. They’ve mostly just thrown more powerful, more expensive hardware at it. It’s a workaround, not a fix.

Enter the alternative: a model that never looks back

So what if, instead of re-reading everything every single time, the AI just… remembered the important stuff as it went along, like a person does?

That’s the intuition behind an approach called State Space Models, and specifically a version of it called Mamba — yes, named after the snake, because it’s fast and it moves in one direction.

Here’s the difference in plain terms:

A Transformer is like a hiring committee that interviews all 50 candidates at once, in the same room, comparing everyone to everyone before making a decision.
A State Space Model is like a single hiring manager who interviews candidates one at a time, updates their mental notes after each interview, and never needs to go back and re-interview anyone. By candidate 50, they just know who was good, without re-checking the first 49.

That “mental notepad” the hiring manager keeps is called the hidden state — a small, fixed-size summary that gets updated with each new piece of information. It doesn’t grow as the conversation grows. That’s the whole trick. And because it doesn’t need to keep looking backward, the compute cost grows linearly — double the text, roughly double the work, not quadruple it.

State Space Models themselves aren’t new — the core math actually comes from 1960s control theory (think spacecraft trajectory calculations, not chatbots). What changed in 2023 was a breakthrough from researchers Albert Gu and Tri Dao: they made the model’s “notepad” selective — smart enough to decide what’s actually worth remembering and what’s safe to ignore, the same way you’d tune out filler words like “um” and “the” but lock onto a name or a date.

Think about how you read a mystery novel. You don’t consciously re-read every previous page every time you turn to a new one — but somehow, when the killer is revealed on page 300, your brain instantly recalls the clue from page 12. Your brain compressed the relevant stuff into memory and let the rest fade. That’s roughly the philosophy behind selective state space models.

So… does this mean Transformers are finished?

Here’s where a lot of hyped-up articles online get it wrong, so let’s be straight about it: no, not even close, at least not yet.

As of mid-2026, here’s the honest state of play:

For everyday chat, coding help, and typical AI-assistant tasks, Transformer-based models (like the ones powering most chatbots today) are still generally sharper and more capable. Attention’s “look at everything at once” approach is simply more powerful for nuanced understanding, and right now nothing fully matches that quality across the board.
Where State Space Models genuinely shine is in situations with enormous amounts of sequential information: analyzing entire genomes in biology research, processing long streams of audio in real time, running AI on small devices like phones or wearables where memory is tight, or handling agents that need to “remember” huge amounts of history without the cost exploding.
The most promising direction right now isn’t “replace Transformers entirely” — it’s hybrid models. Think of it like a hybrid car: use the efficient state-space engine for the long, steady highway stretches, and switch to the powerful (but thirstier) attention engine for the tricky parts that need sharp focus. Some research labs have already released hybrid architectures along these lines, mixing both approaches in the same model to get the best of both.
Newer generations of pure state-space models (researchers have continued iterating past the original Mamba, into “Mamba-2” and beyond through 2026) have closed a lot of the quality gap and shown real speed advantages on long-sequence tasks specifically — which is genuinely exciting and worth watching.

So the honest takeaway isn’t “everything you know is obsolete by next week.” It’s: a serious, credible alternative to the dominant approach now exists, it’s improving fast, and it’s already changing how engineers think about designing AI systems for very long inputs. That’s a big deal even without the clickbait.

Why should a non-engineer even care about this?

Fair question. Here’s why this quietly affects you even if you never write a line of code:

1. Cheaper AI tools, eventually. A huge chunk of what you pay for (or what makes free AI tools unsustainable) is the compute cost of running these models. Architectures that use less compute for the same task mean cheaper subscriptions and more generous free tiers down the line.

2. AI that works without the internet. Because these leaner architectures can run efficiently on regular hardware, they open the door to genuinely useful AI running directly on your phone or laptop — no sending your data to a server, no lag, works on a plane.

3. AI that can actually read the whole thing. Ever had an AI “forget” something you told it earlier in a long chat? Architectures built for efficient long-range memory are a big part of the fix for that specific annoyance.

4. It’s a healthy sign for the field. For years, it felt like “bigger Transformer, more GPUs” was the only playbook anyone had. Serious alternatives emerging means researchers are still finding genuinely new ideas — not just scaling the same idea up. That’s good for innovation, and good for prices eventually coming down instead of only going up.

The one-paragraph version, if you skimmed all of that

Today’s AI reads text by comparing every word to every other word, which gets expensive fast as text gets longer — like a meeting where everyone has to individually talk to everyone else. A newer approach called State Space Models (with Mamba as the best-known example) instead keeps a compact running memory and updates it as it goes, like a person taking notes instead of re-reading everything from scratch. It’s not replacing today’s AI outright, but it’s fast, efficient, especially good at handling huge amounts of information, and it’s pushing the whole field toward smarter, cheaper hybrids. Keep an eye on this one.

Did this actually make sense?

If this helped the whole “attention vs. state space” thing finally click, drop a comment below — I genuinely want to know which part landed and which part I should explain differently next time. And if you’re into AI concepts explained without the jargon-wall, follow along, I’m writing more of these.

Comments

Loading comments…