- Published on
Markov Decision Processes Explained With Games (Without Heavy Math)
- Authors
- Name

Image: Here
If you have ever tried to make an agent “play well” in a game, you have already bumped into the core problem that Markov Decision Processes (MDPs) were built for: making good choices when the next moment depends on what you do now. That sounds obvious, but it becomes powerful once you treat it as a design pattern. You stop thinking in terms of “clever tricks” and start thinking in terms of state, action, feedback, and long-run payoff, which are crucial not only in games but even in corporate risk management practices as logical applications.
The nice part is that you do not need heavy math to get real value from MDP thinking. You just need to be precise about what the agent knows, what it can do, what happens next, and what “good” means. Once those pieces are clear, most of the standard ideas, like policies, value, planning, and learning, start to feel natural.
A card-table view of “state, action, reward”
A good way to internalize an MDP is to pick a game where each decision is small, but the outcome is meaningful. A classic card table fits that perfectly because each choice is local, yet it changes the rest of the hand.
In Blackjack, no matter if you play Blackjack at Thunderpick where dozens of variations exist or on simpler platforms, you can describe the situation at any moment as a “state” that captures what matters for the next decision. In a simple version, that state can be your current total, whether you have a usable ace, and the dealer’s visible card. From there, your “actions” are the choices you can take right now: draw another card, stop, or use special moves when available. The “reward” is the result at the end of the hand, expressed in whatever score you care about. Most of the time, intermediate steps have no reward at all, and that is fine. The payoff arrives later, which is exactly the kind of delay MDPs are designed to handle.

This “decision dashboard” shows a Markov Decision Process in one Blackjack moment: the state (player total, dealer upcard, usable ace) leads to an action (hit or stand), and the reward comes from the hand’s final outcome.
The image was created by us, specifically, for this article.
The “Markov” part is the idea that the future depends on the present state, not the full history. In this card game, that is mostly true if your state is rich enough. If you include what is left in the shoe, then the next-card chances are fully determined by the current state. If you do not, you can still build a useful model by treating draws as coming from a stable distribution. That kind of choice, between a richer state and a simpler one, shows up everywhere in real systems.
This is also why the setup works so well in online blackjack. Many such experiences make the relevant information easy to read and consistent from hand to hand, which helps you decide what belongs in the state. Across many blackjack platforms, the core loop stays the same: observe the state, pick an action, transition to a new state, and eventually receive a final outcome.
Once you see it that way, a “policy” is just your decision rule: given this state, take that action. A “value” is the long-run quality of being in a state if you follow a policy from there. That is why the game of Blackjack is such a clean anchor example for MDPs. It turns “strategy” into a concrete mapping from situations to choices, with the payoff defined in a clear, repeatable way.
Why size changes everything, even when the idea stays the same
The basic MDP story does not change as you move from small games to large ones. What changes is whether you can afford to represent everything explicitly.
In small settings, you can often store a value for every state, update it, and converge to a strong policy with straightforward dynamic programming. But as soon as the number of distinct positions explodes, you have to compress, sample, or approximate.
Here is a sense of scale from well-studied games:
| Game | Rough number of distinct positions | What that implies for MDP-style methods |
|---|---|---|
| Connect Four | 4,531,985,219,092 | Too large for “store everything,” but structured enough for smart search and compact representations |
| Chess | ~10^43 positions | Brute-force planning is out of reach; value needs heuristics and learned approximations |
| Go | >10^170 positions | You must generalize from patterns; sampling and function approximation are central |
The Connect Four position count appears in modern systems work that uses it to reason about storage and hashing. And the Go scale is widely cited in technical writing about computer Go, including in Communications of the ACM.

The chess estimate is spelled out directly in Shannon’s early computer-chess paper, including the “roughly 10^43” scale for possible positions. Image: Here
The takeaway is simple: MDPs give you a stable way to think, but the representation is the real fight. Once state counts get huge, you stop asking “What is the exact value of every state?” and start asking “What compact signal lets me choose well most of the time?”
Defining “state” so the future becomes predictable
If there is one practical lesson that transfers from games to everything else, it is that the hardest part is usually not the algorithm. It is deciding what the agent should treat as the state.
An MDP assumes that the state summarizes what matters for predicting what happens next and for scoring outcomes. In perfect-information board games, the full board is an obvious state. In many interactive problems, you only see part of what is going on, or the world is too detailed to track directly. Then you build a state out of features, short history windows, or summaries that are “good enough” for decision making.
Checking whether your state really works
This is also why benchmarks often use many different environments: they stress-test whether your state design and learning method generalize, not just whether they solve one narrow task. For example, a common evaluation suite for classic video game tasks uses:
- A canonical set of 57 games.
- The same evaluation rules across that set.
- The idea that “works on one game” is a much weaker claim than “works across the set”.
A clean way to connect this back to the core idea is how classic texts describe the planning side of MDPs. Sutton and Barto put it bluntly when explaining what dynamic programming is doing: “compute optimal policies given a perfect model of the environment as a Markov decision process (MDP).”
In practice, you rarely have a perfect model, and your “state” is rarely perfect either. But games teach the right instinct:
- Make your state explicit.
- Ask whether it is informative enough that the next step feels predictable.
- If it is not, improve the state before you blame the learning algorithm.
