Workflows as First-Class Artifacts: Defining Operations for AI
How DevSpark's Harness Runtime turns ad-hoc AI interactions into version-controlled, validated, reproducible workflow specs — and what changed.
DevSpark Series — 24 articles
- DevSpark: Constitution-Driven AI for Software Development
- Getting Started with DevSpark: Requirements Quality Matters
- DevSpark: Constitution-Based Pull Request Reviews
- Why I Built DevSpark
- Taking DevSpark to the Next Level
- From Oracle CASE to Spec-Driven AI Development
- Fork Management: Automating Upstream Integration
- DevSpark: The Evolution of AI-Assisted Software Development
- DevSpark: Months Later, Lessons Learned
- DevSpark in Practice: A NuGet Package Case Study
- DevSpark: From Fork to Framework — What the Commits Reveal
- DevSpark v0.1.0: Agent-Agnostic, Multi-User, and Built for Teams
- DevSpark Monorepo Support: Governing Multiple Apps in One Repository
- The DevSpark Tiered Prompt Model: Resolving Context at Scale
- A Governed Contribution Model for DevSpark Prompts
- Prompt Metadata: Enforcing the DevSpark Constitution
- Bring Your Own AI: DevSpark Unlocks Multi-Agent Collaboration
- Workflows as First-Class Artifacts: Defining Operations for AI
- Observability in AI Workflows: Exposing the Black Box
- Autonomy Guardrails: Bounding Agent Action Safely
- Dogfooding DevSpark: Building the Plane While Flying It
- Closing the Loop: Automating Feedback with Suggest-Improvement
- Designing the DevSpark CLI UX: Commands vs Prompts
- The Alias Layer: Masking Complexity in Agent Invocations
Imagine managing a production database by manually typing SQL statements into a terminal every time a schema change is needed. We abandoned that practice in favor of migrations, version control, and CI/CD pipelines. The discipline holds: if it matters, it gets a file, a diff, and a review.
I kept noticing that I wasn't applying the same discipline to my AI interactions. Every specification session, every critic run, every PR review was a transient conversation — typed once, lost when the window closed. When I wanted to re-run a workflow that had worked well, I was reconstructing it from memory. When a colleague asked how I'd approached a complex refactoring, I couldn't share the interaction. The approach existed only in my chat history.
The work to fix this became DevSpark's Harness Runtime, shipped in v2.0.0.
What a Workflow Artifact Actually Is
A DevSpark workflow is a YAML file with a defined schema (apiVersion: devspark.ai/v1). It declares a sequence of operations — what context to gather, which agent to invoke, what to validate, how to handle failures — in a format that the devspark harness run command can execute end-to-end.
The sample.harness.yaml in the DevSpark repository shows the structure at its simplest:
apiVersion: devspark.ai/v1
kind: HarnessSpec
metadata:
name: scaffold-api-endpoint
description: Generates a new API endpoint with validation and tests
steps:
- id: gather-models
action: read_files
target: src/domain/models/**/*.cs
- id: generate-controller
adapter: claude_code
prompt_ref: .devspark/templates/prompts/api-controller.md
depends_on: [gather-models]
validate:
- rule: file.exists
path: src/api/controllers/
- id: generate-tests
adapter: claude_code
prompt_ref: .devspark/templates/prompts/api-tests.md
depends_on: [generate-controller]
validate:
- rule: command.exit_code
command: dotnet test
expected: 0Every step has a purpose, explicit dependencies, and validation rules. If the controller step fails, the test step doesn't run. If validation fails, the run exits with a structured error, not a silent wrong output.
The Two Execution Modes
One of the decisions I made early in the Harness design: there had to be a way to preview a workflow without executing it. --mode plan runs the workflow in read-only mode. Steps that would write files or invoke terminal commands are skipped, but the prompts are still assembled and the agent receives a prefixed instruction indicating it's operating in plan mode. The output shows what would happen without the side effects.
--mode act is the default — full execution, all steps, all writes, all commands.
The distinction matters for workflows touching infrastructure or making irreversible changes. Running --mode plan first to review the assembled context and the proposed steps adds ten seconds and has caught me from running the wrong spec against the wrong project more than once.
Adapters and Portability
The Harness Runtime supports five built-in adapters: copilot, claude_code, cursor, manual, and noop. The adapter determines how a step sends its prompt to an AI agent. The manual adapter pauses execution and waits for a human to complete the step — useful for steps that require judgment the AI shouldn't be trusted to make autonomously. The noop adapter skips the LLM call entirely and returns a placeholder, useful for testing the harness spec structure without consuming API tokens.
What this means practically: a workflow I wrote for Claude Code works for a colleague using Copilot by changing the adapter field. The context gathering, the validation rules, the artifact tracking — all of that is adapter-agnostic. The spec is the portable artifact. The adapter is an implementation detail.
Run Artifacts and Telemetry
Every Harness run produces artifacts under .documentation/devspark/runs/. Each run gets a run.json with the structured execution record and a JSONL event log with a timestamped entry for every action the runtime took. Artifact delta tracking records which files were created, modified, or deleted in each step.
The practical value: when a workflow produces unexpected output, I don't have to reconstruct what happened from memory. The run record shows exactly what context was gathered, what prompt was sent, what validation ran, and what files changed. Debugging a non-deterministic AI process becomes close to debugging a deterministic one — there's a trail.
What This Changed
The most concrete change is that useful workflows are now shareable. If I develop a workflow for scaffolding a specific pattern in a .NET codebase — with the right context gathering, the right validation, the right retry logic — I commit the YAML file. A colleague clones the repo, runs devspark harness run scaffold-api-endpoint.yaml, and gets the same result I did. Not approximately the same result. The same result, because the same spec controls what context the agent sees and what validation the output must pass.
The more subtle change is in how I think about AI-assisted development. When I'm refining a prompt that almost works but not quite, I'm now editing a file in version control, not re-typing a chat message. The diff shows exactly what changed. The next run shows whether it helped. That discipline — iterate on the artifact, track the changes — is the same discipline that makes software maintainable over time. It turns out it works just as well on AI workflows.
The repo history knows what I built. That's not a small thing.
