
QA TEAM ! Before You Touch an LLM API, Check Your Foundation🛑
Everywhere, teams are rushing into LLMs to generate tests. They spend months, budgets, and energy. And many fail.
Not because the AI is bad. But because they are trying to build a skyscraper on a shaky foundation. 🏚️
I’ve guided several teams through this exact rollout.
The playbook is always the same: you buy tokens, plug in a generator, pump out hundreds of tests, watch code coverage skyrocket, and celebrate on Slack.
Then the first critical bug hits production, and the AI-generated tests miss it entirely. Because they covered the code, not the business logic. 🎯
They learned the hard way what I am about to tell you here.
Before you augment your QA practice with anything — LLMs, codegen, semantic analysis — you must validate 7 structural invariants. It’s tedious, it’s unglamorous, and it won’t make for pretty slides for the C-suite.
But without it, AI will just amplify your existing problems. Faster, and at a higher cost. 💸
💡 AI amplifies what already works. Before you automate, look at whether your test suite is actually worth building on top of. Be honest with your headcount. If you score below 5/7, you are not ready for the next level.
The 7 Structural Invariants 🛠️
1. A Written Test Strategy 📝
Not a slide deck. Not a shared Google Doc lost in a Drive folder. A versioned file sitting right in the repository that everyone can access and amend.
At a company I used to work for, the strategy only lived in the Lead QA’s head. He went on vacation, a critical PR landed, and nobody knew what needed to be tested at the integration vs. unit level. We merged blind, broke production, and spent Friday night pushing hotfixes.
- The Signal 🚨: If a new dev cannot find the test strategy in the repo within 5 minutes, it does not exist.
2. Requirement Traceability 🔗
In that same team, we boasted 94% code coverage. We felt invincible. Then a minor change in the tax calculation logic triggered a major failure in production. The code was executed by the tests (so coverage stayed green), but the actual expected behavior was never verified.
- The Signal 🚨: Without strict traceability between your specs and your tests, coverage metrics are just noise. Every test must trace back to a requirement; every bug must trace back to a test gap.
3. A Sustainable Test Pyramid 📐
I once saw a startup with 400 E2E tests and barely 12 unit tests. The classic inverted ice cream cone 🍦. The full suite took 45 minutes in CI. Devs would trigger the pipeline, grab a coffee, come back, see a random flaky failure, and hit retry. They eventually disabled half the tests because they were too unstable.
- The Signal 🚨: Your unit tests must outnumber your integration and E2E tests in volume and execution cost. If your suite is too slow and too expensive, it will die.
4. A Shared Definition of Done (DoD) 🤝
If your DoD stops at “code is merged,” you have a systemic issue. Every release turns into a painful negotiation between dev and QA to figure out what was actually validated.
- The Signal 🚨: “Done” means the test criteria have passed. If Dev and QA cannot agree on what “Done” means, you cannot automate your merge gates cleanly.
5. CI Discipline 🏗️
Bypassing a continuous integration pipeline because “you’re in a rush” is an illusion of security. I’ve seen teams look past red builds under the pretense that the failing test was a “known flake.” The result? Major regressions slipped through unnoticed amidst the noise.
- The Signal 🚨: A red build blocks the pipeline. A green build is mandatory to merge. CI is an engineering gate, not decoration.
6. Reproducible Test Data 🔄
Nothing kills confidence faster than a test that passes locally but fails in CI. Or fails depending on execution order. Or passes only because a shared staging database was modified behind the scenes by another script. 👻
- The Signal 🚨: Tests must own their data. No shared mutable state. If your suite cannot run identically in any random order, you cannot trust your tests.
7. Efficient Bug Triage 📋
An untriaged backlog of 300 bugs is invisible technical debt. You lose track of what is high priority, what is obsolete, and what is going to kill your business tomorrow.
- The Signal 🚨: Every open bug must have an owner, a clear severity, and an active status. If your backlog is a graveyard, you cannot effectively prioritize what deserves to be tested.
The Remediation Track: What to do if you score < 5/7 🚑
If you don’t hit the baseline, forget about magic tools and autonomous agents for now. A messy suite will simply pollute your signal and waste your compute budget. You need to fix the foundations first.
We’ve battle-tested this approach with teams in crisis. It’s unglamorous, it takes time, but it works.
*⚠️ **The Central Anti-pattern:***Jumping straight to Phase 4 (gap filling) without Phase 2 (risk mapping) = generating tests for dead features while completely missing the true high-risk breaking points.
- Phase 1 — Inventory (Sequential, Mandatory) 🗂️
- Map out everything you currently have: tests, requirements, recent bugs, hot zones. No action, no fixes yet — just a map. If you skip this, you are troubleshooting blind.
- Phase 2 — Risk / Coverage Mapping (HARD GATE) 🗺️
- Cross-reference your inventory with business risk to build a heatmap. Which zones are over-tested (candidates to prune)? Which critical zones are bare? This is the only hard gate of the track. Nothing downstream is valid until this map is locked down.
- Phase 3 — Debt Reduction (Parallelizable) 🧹
- Once the map is validated, clean up in parallel. Deduplicate identical tests, quarantine or fix flakes, refactor fragile selectors, and strengthen weak assertions.
- Phase 4 — Targeted Gap Filling (Sequential, Requires P2 GREEN) 🎯
- Write tests only for the gaps identified in Phase 2. This is where AI can help you write faster, but with one condition: generated tests are just stubs. Human review is mandatory before any merge. AI hallucinates specs; review saves you.
- Phase 5 — Manual Offload (Ongoing, Parallelizable) 🤖
- Identify high-value, repetitive manual regression tests and convert them into automated suites (via codegen or session recordings). This runs in the background without blocking regular sprint delivery.
Exit Condition 🚪: You only leave the remediation tunnel and move into advanced automation (Level L1) when your baseline check scores $\ge$ 5/7 on the core invariants.
Conclusion 🏁
AI does not fix your foundations; it exposes them.
If your test suite is healthy, AI will give you velocity superpowers 🦸♂️.
If it’s broken, you will just industrialize the production of false positives and developer frustration.
Level L0 is the most thankless tier of augmented QA. But it’s the only one that separates teams that ship with confidence from those that spend their weekends managing rollbacks. 🛑
We’ve mapped out this entire framework along with checklists for every phase in our public methodology at ia-qa.com/method#l0. You can use it to self-assess your team in 2 minutes and pinpoint your actual priorities. 👇
Share, like, comment, always appreciated :)
