DevSpark: Months Later, Lessons Learned
After months of using DevSpark across real projects, the theory met reality. This article is a practitioner's check-up — what survived contact with production, what surprised me, and the lessons I didn't expect about AI confidence, adversarial review, and the economics of doing it right.
DevSpark Series — 24 articles
- DevSpark: Constitution-Driven AI for Software Development
- Getting Started with DevSpark: Requirements Quality Matters
- DevSpark: Constitution-Based Pull Request Reviews
- Why I Built DevSpark
- Taking DevSpark to the Next Level
- From Oracle CASE to Spec-Driven AI Development
- Fork Management: Automating Upstream Integration
- DevSpark: The Evolution of AI-Assisted Software Development
- DevSpark: Months Later, Lessons Learned
- DevSpark in Practice: A NuGet Package Case Study
- DevSpark: From Fork to Framework — What the Commits Reveal
- DevSpark v0.1.0: Agent-Agnostic, Multi-User, and Built for Teams
- DevSpark Monorepo Support: Governing Multiple Apps in One Repository
- The DevSpark Tiered Prompt Model: Resolving Context at Scale
- A Governed Contribution Model for DevSpark Prompts
- Prompt Metadata: Enforcing the DevSpark Constitution
- Bring Your Own AI: DevSpark Unlocks Multi-Agent Collaboration
- Workflows as First-Class Artifacts: Defining Operations for AI
- Observability in AI Workflows: Exposing the Black Box
- Autonomy Guardrails: Bounding Agent Action Safely
- Dogfooding DevSpark: Building the Plane While Flying It
- Closing the Loop: Automating Feedback with Suggest-Improvement
- Designing the DevSpark CLI UX: Commands vs Prompts
- The Alias Layer: Masking Complexity in Agent Invocations
It's Been Months
When I published the comprehensive overview in late February, I had a framework, a philosophy, and a handful of real experiments to point to. What I didn't have was months of daily usage across projects that actually mattered — production code, client-facing features, the kind of work where a subtle bug doesn't just fail a test but costs real time and trust.
Now I have that. And the picture is more nuanced than I expected.
Some parts of DevSpark have proven themselves beyond what I hoped. The Critic — which I initially thought of as a nice-to-have — has become the single most valuable element in my workflow. Other parts have shown their seams. The constitution gets stale faster than I anticipated. Multi-model workflows introduced friction I hadn't fully accounted for. And the economics of spec-driven development are real — in both directions.
This isn't a victory lap. It's a check-up. Here's what I've actually learned.
The Confidence Trap I Nearly Fell Into
There's a moment — and if you've been working with AI coding agents for any length of time, you've probably felt it — where the generated code looks so clean, so complete, so right that you start trusting it at face value. It compiles. It passes tests. It follows the patterns in the codebase. The pull request looks better than what most human developers would produce on a Tuesday afternoon.
I caught myself doing this more than once. The code was correct — technically. But it lacked the kind of awareness that comes from understanding how a system fails, not just how it works. Exception handling that covered the obvious cases but missed the subtle ones. API integrations that assumed the happy path was the only path. Configuration approaches that were clean in development but fragile in production.
This is exactly the garbage in, garbage out problem I wrote about in Part 1 — but experiencing it firsthand gave it a different weight. When I was theorizing about requirements quality, it felt like an academic insight. When I was staring at production code that had sailed through my own review because it looked right, it felt personal.
AI didn't introduce bad code into my workflow. It introduced confidently incomplete code — systems that looked production-ready on the surface while hiding gaps in resilience, failure awareness, and operational readiness. The danger isn't wrong code. It's code that's right enough to pass every check but wrong in ways you only discover at 2 AM.
The spec-driven approach doesn't eliminate this problem. But it forces the conversation about failure modes, edge cases, and operational concerns to happen before the code exists — when it's cheapest to address.
What the Critic Actually Taught Me
When I introduced the critic prompt in Part 3, I described it as adversarial risk analysis — a pre-mortem for AI-generated plans. The first few times I ran /devspark.critic, I thought it was being paranoid. Flagging risks in code that already had tests. Raising concerns about failure modes that seemed unlikely. Classifying things as "showstopper" that I would have waved through.
A month later, I started running it on everything.
The shift wasn't dramatic. It was gradual. The critic didn't catch some catastrophic bug that saved the day — nothing that clean happened. What happened instead was subtler and, I think, more important: my thinking changed. I started mentally running the critic before I invoked it. When reviewing AI-generated code, I found myself asking "how does this fail?" instead of "does this work?" — which is a fundamentally different question, and one that produces fundamentally different code.
| What I Expected | What Actually Happened |
|---|---|
| The critic catches bugs before merge | The critic shifted how I think about code before I write it |
| A one-time check at the end of the pipeline | Integrated into every stage of the workflow |
| Overhead that would feel like a tax | The fastest path to genuine confidence in AI-generated output |
| Occasional useful catches | A consistent reframing from validation to pre-mortem |
The critic's real contribution isn't the specific risks it flags — though those matter. It's the adversarial mindset it cultivates. I've started thinking of it less as a tool and more as the voice of the senior developers early in my career — the ones who always pushed back before you got too far down the wrong path. Modern teams, optimizing for velocity, have often lost that voice. The critic brings it back.
The question the critic answers isn't "did we build it right?" It's "how will this fail in production?" Those are two entirely different conversations, and only one of them prevents incidents.
Once you fully internalize that adversarial mindset, a structural limit becomes visible: a single AI model cannot genuinely play both creator and adversary. The same blind spots that shaped its generated code shape its review of that code. The critic prompt changes how you think, but it doesn't change the model's fundamental frame of reference. A developer proofreading their own code faces the same problem — they know what they meant to write, so they tend to read what they meant rather than what's there. The adversarial mindset is the recognition that you need a different perspective. The multi-model workflow is the systemic answer to that recognition.
The Single-Model Trap
For the first several weeks, I was using Claude for everything in the DevSpark pipeline — spec, plan, tasks, implementation, critic, review. It felt efficient. The outputs were consistent, the context carried through, and the friction was minimal.
Too minimal, it turned out.
When the same model writes the specification and then validates the implementation against that specification, you have a closed loop of agreement. The model evaluates its own assumptions. It checks its own work. And it will always pass its own test — not because it's dishonest, but because the blind spots that shaped the spec are the same blind spots that shape the review.
I started noticing this pattern when critic reviews were consistently positive. Not "nothing to worry about" positive — just never surprising. The feedback confirmed what I already thought. Which is exactly the problem.
The shift happened when I started using a different model for critic passes — bringing in GPT-4 for the adversarial review while using Claude for the implementation work. The disagreements were immediate and illuminating. Different models have genuinely different analytical patterns. What one treats as obvious, another questions. What one considers standard practice, another flags as a risk.
This mirrors something I've seen throughout my career: the best reviews come from people who don't share your assumptions. The same principle applies to AI models. Cognitive diversity matters — whether the cognition is human or artificial.
| Role | What Works | Why |
|---|---|---|
| Specification and planning | One model with deep context | Continuity matters for coherent design |
| Implementation | Builder model with codebase access | Familiarity with existing patterns |
| Critic and adversarial review | A different model entirely | Fresh perspective breaks agreement loops |
I'll be honest about the friction: switching between models means losing context. You have to re-explain decisions, re-share architecture, re-establish constraints. It's real overhead. But for anything that matters — features touching production, architectural changes, security-sensitive code — the diversity of perspective is worth the cost of context transfer.
Is this the future of AI-assisted development — assembling a team of models the way you'd assemble a team of people, each bringing different strengths and blind spots? I'm starting to think it might be.
The Economics: What It Actually Cost
The honest answer is that spec-driven development with the full DevSpark pipeline adds 30-40% to the initial specification and planning phase of a feature. That's not trivial. When stakeholders are asking "when will it be done?" and peers are shipping features with a prompt and a prayer, the temptation to skip the governance and just build is real.
But here's what I found when I tracked the full cycle:
| Phase | Without DevSpark | With DevSpark |
|---|---|---|
| Specification | Quick or skipped entirely | 30-40% longer |
| Implementation | Fast, but rework-heavy | Steady, fewer surprises |
| Debugging and fixing | Where most of the time actually went | Dramatically reduced |
| Production issues | Discovered by users | Caught before merge |
| Total cycle | Felt fast, finished slow | Felt slow, finished fast |
The overhead isn't evenly distributed. It's front-loaded. And front-loaded cost is the cheapest kind — a problem caught during specification costs a fraction of what it costs in production. We've known this since the 1970s. We just keep learning it again with every new technology cycle.
The caveat — and this matters — is that not every piece of work deserves this level of rigor. I still generate throwaway scripts without a constitution. I still vibe-code prototypes when I'm exploring an idea. The right-sizing principle from Part 4 turned out to be more important than I expected: apply the full pipeline to work that matters, and don't let the process become the obstacle.
What I'd Change
Months of real usage have exposed genuine friction points, and I think honesty about these matters more than advocacy.
The constitution gets stale. I wrote a constitution for one project in November and didn't update it until February. By then, the codebase had evolved past several of the original principles. A constitution that doesn't track reality becomes a source of false positives in critic reviews — and false positives erode trust faster than missing the constitution entirely. The simplest fix I've found: treat the constitution like a dependency. If no updates have occurred in 60 days, flag it for review. A cron job or a calendar reminder works. The point is to make staleness visible before it causes problems.
Critic false alarms are real. Especially in the early stages, the critic flagged risks that weren't risks — theoretical failure modes that couldn't occur given the actual architecture. Each false alarm chips away at the credibility of the review process. Calibrating the critic to the project's actual risk surface takes time and iteration. The practical response is to maintain a short calibration note alongside each constitution documenting which risk categories are not applicable to this project's architecture. It takes five minutes and dramatically reduces noise.
Teams will bypass the process. When deadlines press, the full pipeline is the first thing sacrificed. I've done it myself. The question isn't whether this happens but whether the framework degrades gracefully. A formalized "light mode" helps: for throwaway scripts and internal tooling, skip the multi-model adversarial review but retain the base spec and task generation. For hotfixes under production pressure, skip the critic but document the exception inline. The goal is to keep teams inside the framework even when they can't use all of it — rather than losing them entirely when perfect becomes the enemy of good.
Human-in-the-loop adds real latency. Every gate that requires human review is a gate where work waits. For small teams and solo developers, this is manageable. For larger teams with async workflows, the gates can become bottlenecks if not designed carefully.
These aren't theoretical concerns. I've hit every one of them.
The Governance Question
After months of working this way, I've come to think of governance not as bureaucracy but as clarity. The constitution doesn't slow you down. Not knowing what you're building slows you down. Not knowing your constraints slows you down. Discovering architectural misalignment in production — that slows you down.
The division of labor is what makes this sustainable. AI surfaces risks, identifies edge cases, enforces patterns, and runs the pre-mortem. Humans make the actual decisions — accepting risks, prioritizing tradeoffs, and owning the outcomes. AI can identify risk. Only humans can accept it.
The system also has to learn. The constitution and critic rules should evolve based on what actually fails in production — not just what the models predict might fail. That feedback loop, from production incidents back into governance artifacts, is what separates a living framework from a write-once document.
The minimal version of this loop is straightforward enough to start today:
- Identify the edge case — when a production incident or missed edge case surfaces, write one sentence describing what the system got wrong.
- Write the constraint — translate that into a single, concrete rule: "Always validate X before Y" or "Never assume Z in the context of W."
- Commit it to the constitution — add it to the project's DevSpark constitution with a note linking to the incident.
That's it. No automation required to begin. The discipline of closing that loop manually — turning incidents into rules — is itself the practice worth building. As the project grows, that manual process reveals exactly which parts are worth automating.
The Honest Balance Sheet
Months into this journey, I'm not sure I'd call DevSpark a revolution. It's more like finally having the discipline I always knew I needed — backed by tooling that makes the discipline sustainable. The pattern that persists across four decades of my career is the same: iterate on the model, not the code. Whether the model is an Oracle CASE repository, an ASP.NET Maker configuration, or a DevSpark constitution, the principle holds.
What surprised me most wasn't any single lesson but the cumulative effect. The confidence trap becomes visible. The critic shifts your thinking. Multi-model diversity catches what single-model consistency misses. And the economics — felt slow, finished fast — compound over time.
The question I keep coming back to isn't whether this approach works. It does, for the work that warrants it. The question is whether I'd go back to working without it. And the honest answer — for anything that matters — is no.
The teams that will navigate AI-assisted development successfully won't be the ones using better models. They'll be the ones building better systems around those models. Start with the Critic. Define roles. Let the process learn from production. Build governance — not just code.
Explore More
- DevSpark: Constitution-Driven AI for Software Development -- DevSpark aligns AI coding agents with project architecture and governanc
- DevSpark in Practice: A NuGet Package Case Study -- Four consecutive specifications on a production .NET package — what spec
- Why I Built DevSpark -- Building the tool I needed to survive the reality of brownfield developm
- Taking DevSpark to the Next Level -- A practitioner's guide to bridging three decades of enterprise developme
- DevSpark: The Evolution of AI-Assisted Software Development -- From requirements discipline to continuous governance — a complete frame
