Observability in AI Workflows: Exposing the Black Box

April 16, 20266 min read

How DevSpark run artifacts, JSONL event logs, and telemetry make AI workflow debugging tractable — turning non-deterministic failures into diagnosable events.

DevSpark Series — 25 articles

Traditional debugging works because systems are deterministic. Given input A, a system produces output B. When B is wrong, you trace backward through the execution — what state existed at each step, which branch was taken, which function returned the unexpected value. The determinism is what makes tracing meaningful.

AI agents break that assumption. A prompt I ran Monday might produce different output Thursday, not because anything changed in the code, but because the model changed its temperature, its context window assembled differently, or a subtle phrasing variation pulled the response in a different direction. When the output is wrong, "trace backward through the execution" is not straightforward when the execution is non-deterministic inference over a 100-billion-parameter model.

The answer isn't to accept opacity. It's to build observability that accounts for the non-determinism — capturing enough state around each AI interaction to reconstruct what actually happened, even when the AI's reasoning itself isn't inspectable.

What DevSpark Captures

Every Harness run in DevSpark v2.0.0+ produces a structured record under .documentation/devspark/runs/. Two artifacts per run:

run.json — The structured execution record. It captures the run ID, the spec file that was executed, the adapter configuration, each step's status (success, failure, skipped), the validation results for each step, and the artifact delta — which files were created, modified, or deleted. When a run fails, run.json records where in the step sequence it stopped and what validation rule it failed against.

JSONL event log — A timestamped entry for every action the runtime took during execution. Context gathering steps, prompt assembly, agent invocations, tool calls, validation checks — each one is a JSON line with the timestamp, the action type, the relevant metadata, and the outcome. The JSONL format makes it easy to grep, pipe into analysis tools, or aggregate across multiple runs.

The artifact delta tracking deserves particular attention. At the start of each step, the runtime takes a snapshot of the declared output files. At the end, it compares. The diff — created, modified, deleted — is recorded in run.json alongside the step result. This means I can look at a completed run and see exactly what each step produced, not just whether the step succeeded.

From "The AI Did Something Wrong" to "Here's What Happened"

The practical shift is significant. Before this observability infrastructure existed, debugging a failed DevSpark run meant: re-running the command, paying close attention to the output, and trying to remember what the prompt said. This produced vague descriptions of failures that were difficult to act on.

With run artifacts, the debugging workflow changes:

devspark harness runs --last 5

Lists the five most recent runs with their status. Pick the failed one:

devspark harness runs show <run-id>

Shows the step-by-step execution record, which step failed, which validation rule it tripped, and the artifact delta up to the point of failure. If the failure is in the AI output itself — the agent generated code that doesn't match the project's patterns — the JSONL log includes what context was assembled into the prompt, which adapter was used, and what model version responded.

The question "where did it go wrong?" becomes answerable. Not always perfectly — the model's internal reasoning is still opaque — but the scaffolding around the AI call is fully visible.

Token and Cost Visibility

One concern I had before building this: AI API costs can spiral when workflows run frequently across a team. Without per-run token tracking, the first indication of a cost overrun is the billing statement.

The event log captures token counts per LLM invocation. Aggregated across a workflow, this gives per-run cost visibility. The retention configuration lets me set how many runs to keep before the archive truncates — a balance between diagnostic utility and storage growth. For workflows I run repeatedly, I can look at the token trend over time and catch if a prompt change inadvertently made the context much larger.

The security dimension matters too. The event log creates an audit trail of which files were read by which steps and which commands were executed. For workflows that touch sensitive paths or invoke deployment scripts, that audit trail is the difference between "we think the workflow only touched what it was supposed to" and "the run record shows exactly what was accessed."

Connecting Observability to Improvement

The suggest-improvement workflow described in the previous article draws directly on run artifacts. When I invoke it after a failure, it reads the most recent run record and pre-populates the improvement proposal with the execution context. The telemetry infrastructure is what makes that pre-population possible — without structured run artifacts, the proposal would have to reconstruct state from memory, which is how feedback loses its precision.

Observability is the foundation for the feedback loop. You can't improve what you can't see. And in AI systems, where the failure modes are subtle and context-sensitive, "seeing" means having structured, queryable records of what the system actually did — not just what you remember it doing.

The goal isn't full transparency into the model's reasoning. That's not achievable with current tools, and it may not be necessary. What is achievable — and what DevSpark's run artifacts provide — is full transparency into everything that surrounds the model: the context assembled, the validation applied, the files touched, the commands executed. That's enough to make most failures diagnosable and most improvements discoverable.

Explore More

Why I Built DevSpark -- Building the tool I needed to survive the reality of brownfield developm
Getting Started with DevSpark: Requirements Quality Matters -- Enforcing requirements quality before code generation
DevSpark: Constitution-Driven AI for Software Development -- DevSpark aligns AI coding agents with project architecture and governanc
DevSpark: Constitution-Based Pull Request Reviews -- How a well-written constitution for your codebase can power automated co
From Oracle CASE to Spec-Driven AI Development -- A 40-Year Journey Through Model-Driven Engineering

Observability in AI Workflows: Exposing the Black Box

What DevSpark Captures

From "The AI Did Something Wrong" to "Here's What Happened"

Token and Cost Visibility

Connecting Observability to Improvement

Explore More

Related posts

DevSpark Blogging Workflow: How I Built Better Articles

Closing the Loop: Automating Feedback with Suggest-Improvement

Designing the DevSpark CLI UX: Commands vs Prompts