DevSpark: Months Later, Lessons Learned

March 18, 202613 min read

After months of using DevSpark across real projects, the theory met reality. This article is a practitioner's check-up — what survived contact with production, what surprised me, and the lessons I didn't expect about AI confidence, adversarial review, and the economics of doing it right.

DevSpark Series — 24 articles

It's Been Months

When I published the comprehensive overview in late February, I had a framework, a philosophy, and a handful of real experiments to point to. What I didn't have was months of daily usage across projects that actually mattered — production code, client-facing features, the kind of work where a subtle bug doesn't just fail a test but costs real time and trust.

Now I have that. And the picture is more nuanced than I expected.

Some parts of DevSpark have proven themselves beyond what I hoped. The Critic — which I initially thought of as a nice-to-have — has become the single most valuable element in my workflow. Other parts have shown their seams. The constitution gets stale faster than I anticipated. Multi-model workflows introduced friction I hadn't fully accounted for. And the economics of spec-driven development are real — in both directions.

This isn't a victory lap. It's a check-up. Here's what I've actually learned.

The Confidence Trap I Nearly Fell Into

There's a moment — and if you've been working with AI coding agents for any length of time, you've probably felt it — where the generated code looks so clean, so complete, so right that you start trusting it at face value. It compiles. It passes tests. It follows the patterns in the codebase. The pull request looks better than what most human developers would produce on a Tuesday afternoon.

I caught myself doing this more than once. The code was correct — technically. But it lacked the kind of awareness that comes from understanding how a system fails, not just how it works. Exception handling that covered the obvious cases but missed the subtle ones. API integrations that assumed the happy path was the only path. Configuration approaches that were clean in development but fragile in production.

This is exactly the garbage in, garbage out problem I wrote about in Part 1 — but experiencing it firsthand gave it a different weight. When I was theorizing about requirements quality, it felt like an academic insight. When I was staring at production code that had sailed through my own review because it looked right, it felt personal.

AI didn't introduce bad code into my workflow. It introduced confidently incomplete code — systems that looked production-ready on the surface while hiding gaps in resilience, failure awareness, and operational readiness. The danger isn't wrong code. It's code that's right enough to pass every check but wrong in ways you only discover at 2 AM.

The spec-driven approach doesn't eliminate this problem. But it forces the conversation about failure modes, edge cases, and operational concerns to happen before the code exists — when it's cheapest to address.

What the Critic Actually Taught Me

When I introduced the critic prompt in Part 3, I described it as adversarial risk analysis — a pre-mortem for AI-generated plans. The first few times I ran /devspark.critic, I thought it was being paranoid. Flagging risks in code that already had tests. Raising concerns about failure modes that seemed unlikely. Classifying things as "showstopper" that I would have waved through.

A month later, I started running it on everything.

The shift wasn't dramatic. It was gradual. The critic didn't catch some catastrophic bug that saved the day — nothing that clean happened. What happened instead was subtler and, I think, more important: my thinking changed. I started mentally running the critic before I invoked it. When reviewing AI-generated code, I found myself asking "how does this fail?" instead of "does this work?" — which is a fundamentally different question, and one that produces fundamentally different code.

What I Expected	What Actually Happened
The critic catches bugs before merge	The critic shifted how I think about code before I write it
A one-time check at the end of the pipeline	Integrated into every stage of the workflow
Overhead that would feel like a tax	The fastest path to genuine confidence in AI-generated output
Occasional useful catches	A consistent reframing from validation to pre-mortem

The critic's real contribution isn't the specific risks it flags — though those matter. It's the adversarial mindset it cultivates. I've started thinking of it less as a tool and more as the voice of the senior developers early in my career — the ones who always pushed back before you got too far down the wrong path. Modern teams, optimizing for velocity, have often lost that voice. The critic brings it back.

The question the critic answers isn't "did we build it right?" It's "how will this fail in production?" Those are two entirely different conversations, and only one of them prevents incidents.

Once you fully internalize that adversarial mindset, a structural limit becomes visible: a single AI model cannot genuinely play both creator and adversary. The same blind spots that shaped its generated code shape its review of that code. The critic prompt changes how you think, but it doesn't change the model's fundamental frame of reference. A developer proofreading their own code faces the same problem — they know what they meant to write, so they tend to read what they meant rather than what's there. The adversarial mindset is the recognition that you need a different perspective. The multi-model workflow is the systemic answer to that recognition.

The Single-Model Trap

For the first several weeks, I was using Claude for everything in the DevSpark pipeline — spec, plan, tasks, implementation, critic, review. It felt efficient. The outputs were consistent, the context carried through, and the friction was minimal.

Too minimal, it turned out.

When the same model writes the specification and then validates the implementation against that specification, you have a closed loop of agreement. The model evaluates its own assumptions. It checks its own work. And it will always pass its own test — not because it's dishonest, but because the blind spots that shaped the spec are the same blind spots that shape the review.

I started noticing this pattern when critic reviews were consistently positive. Not "nothing to worry about" positive — just never surprising. The feedback confirmed what I already thought. Which is exactly the problem.

The shift happened when I started using a different model for critic passes — bringing in GPT-4 for the adversarial review while using Claude for the implementation work. The disagreements were immediate and illuminating. Different models have genuinely different analytical patterns. What one treats as obvious, another questions. What one considers standard practice, another flags as a risk.

This mirrors something I've seen throughout my career: the best reviews come from people who don't share your assumptions. The same principle applies to AI models. Cognitive diversity matters — whether the cognition is human or artificial.

Role	What Works	Why
Specification and planning	One model with deep context	Continuity matters for coherent design
Implementation	Builder model with codebase access	Familiarity with existing patterns
Critic and adversarial review	A different model entirely	Fresh perspective breaks agreement loops

I'll be honest about the friction: switching between models means losing context. You have to re-explain decisions, re-share architecture, re-establish constraints. It's real overhead. But for anything that matters — features touching production, architectural changes, security-sensitive code — the diversity of perspective is worth the cost of context transfer.

Is this the future of AI-assisted development — assembling a team of models the way you'd assemble a team of people, each bringing different strengths and blind spots? I'm starting to think it might be.

The Economics: What It Actually Cost

The honest answer is that spec-driven development with the full DevSpark pipeline adds 30-40% to the initial specification and planning phase of a feature. That's not trivial. When stakeholders are asking "when will it be done?" and peers are shipping features with a prompt and a prayer, the temptation to skip the governance and just build is real.

But here's what I found when I tracked the full cycle:

Phase	Without DevSpark	With DevSpark
Specification	Quick or skipped entirely	30-40% longer
Implementation	Fast, but rework-heavy	Steady, fewer surprises
Debugging and fixing	Where most of the time actually went	Dramatically reduced
Production issues	Discovered by users	Caught before merge
Total cycle	Felt fast, finished slow	Felt slow, finished fast

The overhead isn't evenly distributed. It's front-loaded. And front-loaded cost is the cheapest kind — a problem caught during specification costs a fraction of what it costs in production. We've known this since the 1970s. We just keep learning it again with every new technology cycle.

The caveat — and this matters — is that not every piece of work deserves this level of rigor. I still generate throwaway scripts without a constitution. I still vibe-code prototypes when I'm exploring an idea. The right-sizing principle from Part 4 turned out to be more important than I expected: apply the full pipeline to work that matters, and don't let the process become the obstacle.

What I'd Change

Months of real usage have exposed genuine friction points, and I think honesty about these matters more than advocacy.

The constitution gets stale. I wrote a constitution for one project in November and didn't update it until February. By then, the codebase had evolved past several of the original principles. A constitution that doesn't track reality becomes a source of false positives in critic reviews — and false positives erode trust faster than missing the constitution entirely. The simplest fix I've found: treat the constitution like a dependency. If no updates have occurred in 60 days, flag it for review. A cron job or a calendar reminder works. The point is to make staleness visible before it causes problems.

Critic false alarms are real. Especially in the early stages, the critic flagged risks that weren't risks — theoretical failure modes that couldn't occur given the actual architecture. Each false alarm chips away at the credibility of the review process. Calibrating the critic to the project's actual risk surface takes time and iteration. The practical response is to maintain a short calibration note alongside each constitution documenting which risk categories are not applicable to this project's architecture. It takes five minutes and dramatically reduces noise.

Teams will bypass the process. When deadlines press, the full pipeline is the first thing sacrificed. I've done it myself. The question isn't whether this happens but whether the framework degrades gracefully. A formalized "light mode" helps: for throwaway scripts and internal tooling, skip the multi-model adversarial review but retain the base spec and task generation. For hotfixes under production pressure, skip the critic but document the exception inline. The goal is to keep teams inside the framework even when they can't use all of it — rather than losing them entirely when perfect becomes the enemy of good.

Human-in-the-loop adds real latency. Every gate that requires human review is a gate where work waits. For small teams and solo developers, this is manageable. For larger teams with async workflows, the gates can become bottlenecks if not designed carefully.

These aren't theoretical concerns. I've hit every one of them.

The Governance Question

After months of working this way, I've come to think of governance not as bureaucracy but as clarity. The constitution doesn't slow you down. Not knowing what you're building slows you down. Not knowing your constraints slows you down. Discovering architectural misalignment in production — that slows you down.

The division of labor is what makes this sustainable. AI surfaces risks, identifies edge cases, enforces patterns, and runs the pre-mortem. Humans make the actual decisions — accepting risks, prioritizing tradeoffs, and owning the outcomes. AI can identify risk. Only humans can accept it.

The system also has to learn. The constitution and critic rules should evolve based on what actually fails in production — not just what the models predict might fail. That feedback loop, from production incidents back into governance artifacts, is what separates a living framework from a write-once document.

The minimal version of this loop is straightforward enough to start today:

Identify the edge case — when a production incident or missed edge case surfaces, write one sentence describing what the system got wrong.
Write the constraint — translate that into a single, concrete rule: "Always validate X before Y" or "Never assume Z in the context of W."
Commit it to the constitution — add it to the project's DevSpark constitution with a note linking to the incident.

That's it. No automation required to begin. The discipline of closing that loop manually — turning incidents into rules — is itself the practice worth building. As the project grows, that manual process reveals exactly which parts are worth automating.

The Honest Balance Sheet

Months into this journey, I'm not sure I'd call DevSpark a revolution. It's more like finally having the discipline I always knew I needed — backed by tooling that makes the discipline sustainable. The pattern that persists across four decades of my career is the same: iterate on the model, not the code. Whether the model is an Oracle CASE repository, an ASP.NET Maker configuration, or a DevSpark constitution, the principle holds.

What surprised me most wasn't any single lesson but the cumulative effect. The confidence trap becomes visible. The critic shifts your thinking. Multi-model diversity catches what single-model consistency misses. And the economics — felt slow, finished fast — compound over time.

The question I keep coming back to isn't whether this approach works. It does, for the work that warrants it. The question is whether I'd go back to working without it. And the honest answer — for anything that matters — is no.

The teams that will navigate AI-assisted development successfully won't be the ones using better models. They'll be the ones building better systems around those models. Start with the Critic. Define roles. Let the process learn from production. Build governance — not just code.

Explore More

DevSpark: Constitution-Driven AI for Software Development -- DevSpark aligns AI coding agents with project architecture and governanc
DevSpark in Practice: A NuGet Package Case Study -- Four consecutive specifications on a production .NET package — what spec
Why I Built DevSpark -- Building the tool I needed to survive the reality of brownfield developm
Taking DevSpark to the Next Level -- A practitioner's guide to bridging three decades of enterprise developme
DevSpark: The Evolution of AI-Assisted Software Development -- From requirements discipline to continuous governance — a complete frame