Spec-Driven Development with AI Coding Agents

For about a year, the demo that sold everyone on AI coding was the same: type a sentence, watch a working app appear. It was genuinely impressive, and it was also misleading. Generating a to-do app from a prompt is not the same skill as building and maintaining a real system, and the gap between the two is where most teams have been quietly struggling.

The honest version of the story is this. AI coding agents are extraordinary at producing code. They are far less reliable at producing the right code, consistently, across a large codebase, over time. When you prompt your way through a single function, a wrong guess costs you a minute. When you prompt your way through an entire system, wrong guesses compound. Each plausible but incorrect decision becomes a foundation the next decision is built on, and by the time you notice, the drift is structural rather than local.

Spec-driven development is the discipline that closes this gap. It is not a prompting trick and it is not limited to small features. It is a way of building complete software systems where a written specification, not the code, is the source of truth, and where AI agents derive the implementation from that spec. This idea moved from niche to mainstream over 2025 and 2026, and it is now the default methodology serious teams reach for when they want agents to build something larger than a snippet.

Vibe Coding Has a Ceiling

The unstructured approach to AI coding has a name now: vibe coding. You open a conversation with an agent, describe what you want, and iterate, prompt by prompt, until something works. For prototypes and throwaway scripts, this is fantastic. It is fast, it is fun, and it lowers the barrier to building almost to zero.

It also hits a wall, and the wall is predictable. As a project grows, three failure modes show up together.

The first is drift. The agent produces code that looks correct and reads correctly but quietly solves a slightly different problem than the one you had in mind. Nobody grounded the work in a real specification, so the agent filled the ambiguity with its best guess, and its best guess was confident enough to pass review.

The second is inconsistency. Ask an agent to build ten endpoints across ten separate conversations and you will get ten slightly different opinions about error handling, naming, validation, and structure. There is no shared standard the agent is held to, so each interaction reinvents conventions the last one already chose.

The third is decay. Without a durable artifact describing intent, the only record of what the system is supposed to do is the code itself. When requirements change, you patch the code, the original intent gets harder to reconstruct, and the documentation, if it ever existed, rots. The codebase becomes the kind of place where nobody can answer why something works this way without archaeology.

These are not flaws in the model. They are flaws in the process. The agent did exactly what was asked. The problem is what was asked, and how little of it was written down.

The Inversion: Code Becomes a Build Artifact

The central idea of spec-driven development is a reversal of what we normally treat as primary.

In traditional development, code is the source of truth and the specification, if one exists, is a document that gets written, approved, and then abandoned. The code drifts away from the spec almost immediately, and within weeks the spec describes a system that no longer exists.

Spec-driven development flips this. The specification becomes the durable, version-controlled source of truth, and the code becomes something closer to a build output: derived from the spec, regenerable from the spec, and validated against the spec. When requirements change, you do not patch the code first. You change the spec and regenerate the parts of the implementation it governs. This is much closer to how a compiler treats source code than to how teams have traditionally treated design documents. The spec is the source; the code is the compilation.

This reframing is what makes whole-system AI development tractable. An agent cannot hold an entire large codebase in its head, and neither can you. But an agent can hold a well-structured specification in context, and it can use that spec to generate any individual part correctly because the spec defines how that part relates to everything else. The spec becomes the shared memory that keeps every piece of generated code consistent with every other piece, even across many separate agent sessions and many different agents.

A Workflow for Systems, Not Snippets

The thing that distinguishes serious spec-driven development from writing a good prompt is that it decomposes the act of building into distinct phases, each producing an artifact the next phase consumes. The most widely adopted version of this, popularized by GitHub's open-source Spec Kit and mirrored in tools like AWS Kiro and others, runs through roughly five stages. You can run this entire flow inside a coding agent, and at system scale you run it deliberately.

1. The constitution: rules the whole system obeys

Before any specific feature exists, you establish the ground rules the entire project must follow. This is a persistent document, sometimes literally called a constitution, that every later phase references. It is where you encode the non-negotiables: the architectural style, security requirements such as mandatory input validation, testing standards, documentation expectations, and any conventions that must hold everywhere.

This matters enormously at system scale, because it is the mechanism that solves the inconsistency problem. Every endpoint, every module, every agent session inherits the same constitution. The tenth thing the agent builds obeys the same rules as the first.

2. Specify: the what and the why

Next you describe what the system should do and why, deliberately staying out of implementation detail. Who are the users? What are the core capabilities? What are the success criteria? What is explicitly out of scope? At this stage you are writing a contract for behavior, not choosing a database.

For an entire system, this is not one document but a structured set of them: a high-level product specification, and then a specification per major capability or user story. The discipline is to resolve ambiguity here, in prose, where it is cheap to fix, rather than discovering it later in generated code where it is expensive.

3. Plan: architecture, data models, and contracts

With the what settled, you direct the agent to produce the how. This is where the system takes architectural shape, and it produces the artifacts that whole-system development actually depends on: the technology decisions and the reasoning behind them, the data model with its entities and relationships, and the interface contracts, such as OpenAPI, GraphQL, and message schemas, that define how components talk to each other.

These contract artifacts are the load-bearing structure of a multi-component system. When the data model and the service contracts are defined in the spec, every part the agent generates downstream is built to fit them. This is how you get twenty services that actually interoperate instead of twenty services that each made reasonable but incompatible assumptions.

4. Tasks: decompose into an ordered plan of work

The plan is then broken into concrete, individually verifiable tasks. Good decomposition here is not a flat list. Tasks are grouped by user story or capability, ordered to respect dependencies, and marked for which can proceed in parallel.

This phase is what turns a system-sized ambition into something an agent can execute without losing the plot. Instead of asking an agent to build the platform in one overwhelming request, you have a dependency-ordered backlog of bounded units, each small enough to implement and review with confidence.

5. Implement: phased, incremental, verified

Only now does code get written, and the single most important rule at system scale is to resist generating everything at once. Implement in phases. Build the core, validate that it works, then add capabilities incrementally. Practitioners who have built real systems this way are consistent on this point: generating an entire project in one shot produces a pile of code that is impossible to review and therefore impossible to trust. Small, validated increments are what keep quality high as scope grows.

Throughout implementation, your role is not to type. It is to verify at each checkpoint before letting the agent move to the next. The spec gives you something concrete to verify against, which turns review from a vague question into a precise one: does this match the contract?

Why This Is What Makes Whole-System AI Development Work

It is worth being explicit about why the structure matters more as systems get bigger, rather than less.

An agent's context is finite. It cannot see your entire codebase at once, so when it generates a new component, it needs a compact, authoritative description of everything that component must be consistent with. The spec is exactly that description. The constitution supplies the universal rules, the plan supplies the architecture and contracts, and the relevant capability spec supplies the local behavior. Together they let an agent generate a correct piece of a large system without having read the whole thing.

Consistency across sessions is the second reason. Real systems are not built in one sitting or by one agent. Work spans days, multiple people, and increasingly multiple different agents used side by side. Without a shared source of truth, each session drifts in its own direction. With one, every session is anchored to the same specifications and conventions, so the output stays coherent even when the contributors, human and machine, keep changing.

And reproducibility is the third. Because the implementation is derived from the spec, a misstep is recoverable. If the agent builds a component wrong, you do not necessarily debug the generated code line by line. You check whether the spec was ambiguous, tighten it, and regenerate. The spec being primary means mistakes are corrected at the level of intent, where the fix actually prevents recurrence, rather than at the level of symptoms.

Specs as Living Infrastructure

The payoff that traditional documentation never delivered is that a spec in this model does not rot, because it cannot. It is not a description of the system written alongside the system. It is the thing the system is generated from. Letting it fall out of date would be like letting your source code fall out of date with your binary; the relationship runs the other way.

This changes how change works. A new requirement starts as an edit to the specification. That edit is reviewable on its own, in plain language, before any code moves. It makes the consequences of a change visible at the level of intent, and then the implementation follows. Over the life of a system, this is the difference between a codebase whose reasoning is always recoverable and one whose reasoning has to be excavated from commit history and the memories of whoever happens to still work there.

It also produces an audit trail almost for free, which is why this approach has been adopted quickly in regulated domains. When the specification is version-controlled and the code is derived from it, you have a traceable record connecting every behavior of the system to the documented intent that produced it. For finance, healthcare, and any context where you may have to prove why a system does what it does, that traceability is not a nicety. It is the requirement.

The Human Role Shifts, It Does Not Disappear

A reasonable worry is that this turns engineers into prompt-pushers who rubber-stamp whatever the agent produces. In practice the good version is the opposite. The work moves up a level, from writing implementation to defining intent and verifying outcomes.

Your steering happens at the checkpoints between phases. You shape the constitution. You resolve the genuinely hard product questions in the spec. You make the architectural calls in the plan, because deciding the data model and the service boundaries is still a human judgment with long-lived consequences. And you verify at every phase boundary, because an unreviewed phase is how a small misunderstanding becomes the foundation for everything after it.

This is more demanding than vibe coding, not less. It asks you to think clearly about what you are building before the agent starts, and to actually check the work as it lands. What it removes is the part that was never the valuable part: typing out the implementation by hand once the intent was clear.

The Tooling Landscape, Briefly

You do not need a specific tool to work this way, but the ecosystem has matured fast and the tools encode the workflow so you do not have to assemble it yourself.

GitHub's Spec Kit is the most prominent open-source starting point. It installs a CLI and a set of commands that structure an agent into the specify, plan, tasks, and implement flow, with a constitution underneath, and it works across a long list of agents including Claude Code, Copilot, Cursor, Gemini CLI, and others, so you are not locked into one. AWS Kiro brings a similar philosophy with a tighter IDE integration and has published customer cases of large features delivered in a fraction of the usual human time.

Beyond those, OpenSpec, BMAD, Tessl, and others each take a slightly different position on how living the spec should be and how much governance to layer on. Even a plain conventions file that every agent reads, such as an AGENTS.md at the repository root, captures part of the same value with almost no setup.

The common thread across all of them is the same inversion: define what to build, in durable artifacts, before building it, and let those artifacts drive the generation.

The Honest Tradeoffs

This approach is not free and it is not always the right call.

It has real overhead at the start. Writing a constitution, specifying capabilities, and planning architecture before any code appears feels slow, especially compared to the instant gratification of vibe coding. For a weekend prototype or a script you will run once and delete, that overhead is not worth it. Vibe coding exists for a reason, and throwing it out entirely would be a mistake.

It also front-loads the thinking, which is uncomfortable. The hard product and architecture questions cannot be deferred; the workflow forces you to confront them before implementation. That is precisely the benefit, but it does not feel like a benefit in the moment when you would rather just see something run.

And the specs themselves require discipline to keep authoritative. A spec that the team stops treating as the source of truth quietly reverts to being stale documentation, and you are back where you started. The model only works if the spec genuinely remains primary.

The way to read these tradeoffs is by scale and lifespan. The larger the system, the longer it will live, and the more people and agents will touch it, the more the upfront structure pays for itself. A prototype does not need a constitution. A platform that a team will maintain for years and build with agents desperately does.

Where to Start

You do not have to adopt the whole apparatus on day one. The fastest way to feel the difference is to take one real, non-trivial piece of a system, something with a few moving parts and some genuine ambiguity, and build it the spec-driven way. Write down the rules it must obey, specify what it should do and explicitly what it should not, let the agent produce a plan with the data model and contracts, review that plan, and only then let it implement in small, verified steps.

Then compare that experience to the last thing you vibe-coded into a corner. The contrast tends to be persuasive on its own.

The agents are going to keep getting better at writing code. That was never the bottleneck. The bottleneck is making sure the code they write is the system you actually meant to build, and at any real scale, the only thing that reliably carries that intent is a specification good enough to build from. Write that, and the system follows.

The Spec Is the Source: Building Entire Systems with AI Coding Agents