Product Note: Runtime Explainability
Main use case
Incident explanation and execution replay for the data runtime itself.
The system should answer:
- what happened during a run,
- why a command succeeded or failed,
- how state changed,
- where retries, drift, or bottlenecks appeared.
Value proposition
Make the runtime explainable, not just runnable.
Instead of raw traces and scattered outputs, give one structured explanation of execution, failure cause, and state transition path.
Data and signals to store
- canonical event type:
command.start,command.success,command.error run_idcorrelation_idor parent event link- command name, adapter name, or source
- idempotency key
- payload hash
- scope,
root_dir, and table - attempt number
- retry classification
latency_ms- dry-run predicted effects
- actual effects
- decision reason or rejection reason
before_state_hashandafter_state_hash- changed rows or changed entities summary
- error class and error text
- dependency node or upstream cause
- invariant check results: idempotency, single write path, event coverage
MVP feature for this repository
explain_run
An MCP tool that takes a run_id or a trace slice and returns:
- ordered execution timeline
- which operations ran
- where retries or errors happened
- what state changed
- why the final outcome happened
- whether runtime behavior matched expected design invariants
Example outputs
- “Run X initialized engine, inserted row, rebuilt index, then failed in e2e because DuckDB execution path was unavailable.”
- “Dry run predicted one write and no retries; actual run produced 3 retries and one terminal error.”
- “Single write path preserved, event coverage complete, idempotency not violated.”