Diagnose, bisect, narrate
AI-native debugging on top of scroll-native data. The same substrate that makes replay deterministic makes diagnosis grounded — every hypothesis can be checked against the scroll, every fix can be verified by fork and replay. The reasoning loop closes.
Why AI for maintenance
Classical debugging tools answer "what did the code do?" Event-sourced systems answer that without help — the scroll is the answer. The harder question is why: why did the reconciler reject this quest proposal, why does this prompt produce this edge-case output, which ordering invariant is the new reactor silently violating.
Those questions are underspecified by design — they involve reasoning over causality, intent, and semantic content. That's LLM-shaped work. Scry is the service that packages that reasoning as a set of composable agents each of which reads scrolls, produces verdicts, and writes its own session scroll so its reasoning is itself inspectable.
The agents
Narrator
Reads a scroll and returns a plain-English summary plus 3–7 highlights. Used anywhere 'explain this' is needed — incident reports, session summaries, onboarding new engineers to an unfamiliar scroll. The first Scry agent to ship.
Diagnoser
Reads a failing scroll and its surrounding context. Produces a causal explanation: 'reactor X fired before gate Y because its cursor lagged by one tick; the race is reproducible by re-running with tighter timing.' Not just 'output diverged' — which event, which reactor, which invariant.
Bisector
Given a failing case and a prompt change to try, finds the minimum edit that makes the case pass without regressing others in the corpus. Search is LLM-guided, not brute-force — the bisector hypothesizes which parts of the prompt matter and tests the smallest ones first.
Conformance reviewer
Covered separately in the Conformance review page. Mentioned here for completeness — it's a peer of the diagnoser, sharing the same substrate and model tier.
Test generator
Covered in Tests from scrolls. Another peer. Reads a scroll + a target, produces the Go test that reproduces it deterministically.
Each agent is a regular weave.Agent node.
They run through the same runner every user agent runs
through. Their session scrolls (scry:session:…)
are addressable, forkable, and diffable. If a diagnoser
produces a bad explanation, fork its session scroll,
substitute its prompt, replay, see what changes. The
maintenance service is its own first customer.
Narrate: the shipped primitive
The one agent that exists today. Given any scroll, the narrator returns a summary and a small set of highlights. Used as a context-loading shortcut: instead of asking Claude Code to read 300 events, ask Scry to narrate the scroll and hand it the 150-token summary.
$ scry narrate quest_signals:tavern-42 --json
{
"summary": "The innkeeper turned a wandering bard's three
tavern tales into five quest proposals, four of
which were accepted. The fifth ('rescue the
mayor's daughter from the old keep') was deferred
pending clarification on whether the keep is the
one north of town or the one at the cliffs.",
"highlights": [
"accepted: 'Slay the wyvern on the high road' backed by 2 sightings",
"rejected: duplicate bandit-camp quest on turn 4",
"deferred: 'rescue at the keep' — which keep awaiting disambiguation"
],
"sessionId": "scry:session:0f2-..."
}The narrator writes a session scroll recording the narrate_started / ai.request / ai.response / narrate_completed events. That session is itself narratable, diffable, forkable — useful for debugging the narrator, benchmarking model changes, or replaying narrations deterministically.
Diagnose: causal explanation of divergence
Replay produces a diff. Diff alone answers "what changed." Diagnose answers "why, and what do I do about it." Given a baseline scroll and a replayed scroll, the diagnoser:
- Aligns the two scrolls on their causal chain.
- Identifies the earliest divergence point — the first event where the two streams disagree.
- Reads the code paths that produced that event on each side.
- Proposes a causal explanation — ordered by likelihood, grounded in specific evidence from the scrolls.
- Suggests a fix and estimates the blast radius.
Divergence at sequence 43 (candidate vs baseline):
baseline signal_accepted { signalId: s_9, artifact: quest_xyz }
candidate validator_rejected { signalId: s_9, reason: "duplicate" }
Hypothesis (confidence 0.82):
The new reconcile-quest prompt introduced a tighter
dedup threshold. s_9 was marginally similar to s_4
(accepted at sequence 38); the tighter threshold now
treats it as a duplicate.
Evidence:
- Both quests mention a raiding party on the north road
- Embedding similarity 0.86; old threshold 0.90,
new threshold 0.83 (inferred from prompt diff)
- s_4 was accepted in both runs
Suggested fix:
Two options:
(A) Tighten the prompt further to require literal
text overlap, not just semantic similarity.
(B) Leave as-is — s_9 is arguably a duplicate and
this is the desired behavior. Confirm with the
game master.
Blast radius:
Corpus replay shows 3 more scrolls with similar pattern.
2 of 3 changed the same way; likely the right call.Bisect: find the minimum prompt edit
Prompt engineering today is guess-and-check. Change the wording, spin up a test, hope for the best. Bisect replaces the guesswork with a search.
Given a failing scroll, a target prompt, and an acceptance predicate, the bisector iteratively proposes edits and verifies them against a sandboxed replay. Its search is adaptive — it hypothesizes which parts of the prompt matter, tests the smallest relevant change first, and reports the minimum delta.
$ scry bisect \
--failing quest_signals:tavern-42 \
--prompt prompts/reconcile_quest.tmpl \
--acceptance "signal_accepted for s_9 on scroll tavern-42" \
--corpus ./corpus \
--out bisect/
# Output:
# bisect/minimum-edit.diff — the smallest change that fixes it
# bisect/regression-impact.json — which corpus cases change
# bisect/narration.md — narrator's explanation of the edit
# bisect/session-scroll.jsonl — the scry session for this runThe output is a proposal, not a commit. Claude Code (or a human) reviews the diff, checks the regression impact, decides whether to apply. The decision is easier because the impact is measured, not estimated.
Provider independence
Scry agents default to Gemini. That's a considered choice, not an accident.
- Context window. Scroll analysis often needs the whole scroll plus the reactor code plus the corpus. Gemini's 2M-token context fits this shape.
- Independence from production. If the production model is OpenAI and the analyst is also OpenAI, they share blind spots. Using a different family for the analytical layer keeps the diagnostic lens epistemically independent.
- Thinking mode. Bisection and counterfactual analysis are deep-reasoning workloads; Gemini 2.5 Pro with thinking enabled is well-shaped for them.
- Economics. Conformance sweeps across a repo, nightly corpus regressions, speculative eval runs — Flash's price/performance makes these cadences affordable.
Provider choice is configurable via the router (see internal/ai/router.go). Teams with existing
vendor contracts can swap by registering providers and
renaming models. This is the dogfood test for weave's
abstraction: if swapping providers is hard, the
abstraction has gaps, and Scry is the first thing to
feel them.
Why this closes the loop
Reasoning about an LLM system without an event log is fundamentally speculative — you don't know what the model saw, what it emitted, which tools fired in what order. Adding an event log gives you what; adding a replay engine gives you could have; adding an analytical agent on top gives you why.
Those three layers together close the maintenance loop: the system records what happened, replay reproduces counterfactuals, and the analytical agent explains the gap. No single layer is useful alone. Together, they are what makes maintenance of agentic workflows tractable at production scale.
Status
Narrator ships today. Diagnose and bisect land as Scry v1+. The infrastructure for all three — router, scroll-first replay, session scrolls — is already in place.