Observability in AI Workflows: Exposing the Black Box
How DevSpark run artifacts, JSONL event logs, and telemetry make AI workflow debugging tractable — turning non-deterministic failures into diagnosable events.
DevSpark Series — 24 articles
- DevSpark: Constitution-Driven AI for Software Development
- Getting Started with DevSpark: Requirements Quality Matters
- DevSpark: Constitution-Based Pull Request Reviews
- Why I Built DevSpark
- Taking DevSpark to the Next Level
- From Oracle CASE to Spec-Driven AI Development
- Fork Management: Automating Upstream Integration
- DevSpark: The Evolution of AI-Assisted Software Development
- DevSpark: Months Later, Lessons Learned
- DevSpark in Practice: A NuGet Package Case Study
- DevSpark: From Fork to Framework — What the Commits Reveal
- DevSpark v0.1.0: Agent-Agnostic, Multi-User, and Built for Teams
- DevSpark Monorepo Support: Governing Multiple Apps in One Repository
- The DevSpark Tiered Prompt Model: Resolving Context at Scale
- A Governed Contribution Model for DevSpark Prompts
- Prompt Metadata: Enforcing the DevSpark Constitution
- Bring Your Own AI: DevSpark Unlocks Multi-Agent Collaboration
- Workflows as First-Class Artifacts: Defining Operations for AI
- Observability in AI Workflows: Exposing the Black Box
- Autonomy Guardrails: Bounding Agent Action Safely
- Dogfooding DevSpark: Building the Plane While Flying It
- Closing the Loop: Automating Feedback with Suggest-Improvement
- Designing the DevSpark CLI UX: Commands vs Prompts
- The Alias Layer: Masking Complexity in Agent Invocations
Traditional debugging works because systems are deterministic. Given input A, a system produces output B. When B is wrong, you trace backward through the execution — what state existed at each step, which branch was taken, which function returned the unexpected value. The determinism is what makes tracing meaningful.
AI agents break that assumption. A prompt I ran Monday might produce different output Thursday, not because anything changed in the code, but because the model changed its temperature, its context window assembled differently, or a subtle phrasing variation pulled the response in a different direction. When the output is wrong, "trace backward through the execution" is not straightforward when the execution is non-deterministic inference over a 100-billion-parameter model.
The answer isn't to accept opacity. It's to build observability that accounts for the non-determinism — capturing enough state around each AI interaction to reconstruct what actually happened, even when the AI's reasoning itself isn't inspectable.
What DevSpark Captures
Every Harness run in DevSpark v2.0.0+ produces a structured record under .documentation/devspark/runs/. Two artifacts per run:
run.json — The structured execution record. It captures the run ID, the spec file that was executed, the adapter configuration, each step's status (success, failure, skipped), the validation results for each step, and the artifact delta — which files were created, modified, or deleted. When a run fails, run.json records where in the step sequence it stopped and what validation rule it failed against.
JSONL event log — A timestamped entry for every action the runtime took during execution. Context gathering steps, prompt assembly, agent invocations, tool calls, validation checks — each one is a JSON line with the timestamp, the action type, the relevant metadata, and the outcome. The JSONL format makes it easy to grep, pipe into analysis tools, or aggregate across multiple runs.
The artifact delta tracking deserves particular attention. At the start of each step, the runtime takes a snapshot of the declared output files. At the end, it compares. The diff — created, modified, deleted — is recorded in run.json alongside the step result. This means I can look at a completed run and see exactly what each step produced, not just whether the step succeeded.
From "The AI Did Something Wrong" to "Here's What Happened"
The practical shift is significant. Before this observability infrastructure existed, debugging a failed DevSpark run meant: re-running the command, paying close attention to the output, and trying to remember what the prompt said. This produced vague descriptions of failures that were difficult to act on.
With run artifacts, the debugging workflow changes:
devspark harness runs --last 5Lists the five most recent runs with their status. Pick the failed one:
devspark harness runs show <run-id>Shows the step-by-step execution record, which step failed, which validation rule it tripped, and the artifact delta up to the point of failure. If the failure is in the AI output itself — the agent generated code that doesn't match the project's patterns — the JSONL log includes what context was assembled into the prompt, which adapter was used, and what model version responded.
The question "where did it go wrong?" becomes answerable. Not always perfectly — the model's internal reasoning is still opaque — but the scaffolding around the AI call is fully visible.
Token and Cost Visibility
One concern I had before building this: AI API costs can spiral when workflows run frequently across a team. Without per-run token tracking, the first indication of a cost overrun is the billing statement.
The event log captures token counts per LLM invocation. Aggregated across a workflow, this gives per-run cost visibility. The retention configuration lets me set how many runs to keep before the archive truncates — a balance between diagnostic utility and storage growth. For workflows I run repeatedly, I can look at the token trend over time and catch if a prompt change inadvertently made the context much larger.
The security dimension matters too. The event log creates an audit trail of which files were read by which steps and which commands were executed. For workflows that touch sensitive paths or invoke deployment scripts, that audit trail is the difference between "we think the workflow only touched what it was supposed to" and "the run record shows exactly what was accessed."
Connecting Observability to Improvement
The suggest-improvement workflow described in the previous article draws directly on run artifacts. When I invoke it after a failure, it reads the most recent run record and pre-populates the improvement proposal with the execution context. The telemetry infrastructure is what makes that pre-population possible — without structured run artifacts, the proposal would have to reconstruct state from memory, which is how feedback loses its precision.
Observability is the foundation for the feedback loop. You can't improve what you can't see. And in AI systems, where the failure modes are subtle and context-sensitive, "seeing" means having structured, queryable records of what the system actually did — not just what you remember it doing.
The goal isn't full transparency into the model's reasoning. That's not achievable with current tools, and it may not be necessary. What is achievable — and what DevSpark's run artifacts provide — is full transparency into everything that surrounds the model: the context assembled, the validation applied, the files touched, the commands executed. That's enough to make most failures diagnosable and most improvements discoverable.
