Why AI for maintenance

Classical debugging tools answer "what did the code do?" Event-sourced systems answer that without help — the scroll is the answer. The harder question is why: why did the reconciler reject this quest proposal, why does this prompt produce this edge-case output, which ordering invariant is the new reactor silently violating.

Those questions are underspecified by design — they involve reasoning over causality, intent, and semantic content. That's LLM-shaped work. Scry is the service that packages that reasoning as a set of composable agents each of which reads scrolls, produces verdicts, and writes its own session scroll so its reasoning is itself inspectable.

The agents

Narrator

nano · Gemini 2.5 Flash implemented

Reads a scroll and returns a plain-English summary plus 3–7 highlights. Used anywhere 'explain this' is needed — incident reports, session summaries, onboarding new engineers to an unfamiliar scroll. The first Scry agent to ship.

Diagnoser

mini · Gemini 2.5 Pro designed

Reads a failing scroll and its surrounding context. Produces a causal explanation: 'reactor X fired before gate Y because its cursor lagged by one tick; the race is reproducible by re-running with tighter timing.' Not just 'output diverged' — which event, which reactor, which invariant.

Bisector

full · Gemini 2.5 Pro + thinking designed

Given a failing case and a prompt change to try, finds the minimum edit that makes the case pass without regressing others in the corpus. Search is LLM-guided, not brute-force — the bisector hypothesizes which parts of the prompt matter and tests the smallest ones first.

Conformance reviewer

mini · Gemini 2.5 Pro designed

Covered separately in the Conformance review page. Mentioned here for completeness — it's a peer of the diagnoser, sharing the same substrate and model tier.

Test generator

mini · Gemini 2.5 Pro designed

Covered in Tests from scrolls. Another peer. Reads a scroll + a target, produces the Go test that reproduces it deterministically.

Each agent is a regular weave.Agent node. They run through the same runner every user agent runs through. Their session scrolls (scry:session:…) are addressable, forkable, and diffable. If a diagnoser produces a bad explanation, fork its session scroll, substitute its prompt, replay, see what changes. The maintenance service is its own first customer.

Narrate: the shipped primitive

The one agent that exists today. Given any scroll, the narrator returns a summary and a small set of highlights. Used as a context-loading shortcut: instead of asking Claude Code to read 300 events, ask Scry to narrate the scroll and hand it the 150-token summary.

$ scry narrate quest_signals:tavern-42 --json

{
  "summary": "The innkeeper turned a wandering bard's three
              tavern tales into five quest proposals, four of
              which were accepted. The fifth ('rescue the
              mayor's daughter from the old keep') was deferred
              pending clarification on whether the keep is the
              one north of town or the one at the cliffs.",
  "highlights": [
    "accepted: 'Slay the wyvern on the high road' backed by 2 sightings",
    "rejected: duplicate bandit-camp quest on turn 4",
    "deferred: 'rescue at the keep' — which keep awaiting disambiguation"
  ],
  "sessionId": "scry:session:0f2-..."
}

The narrator writes a session scroll recording the narrate_started / ai.request / ai.response / narrate_completed events. That session is itself narratable, diffable, forkable — useful for debugging the narrator, benchmarking model changes, or replaying narrations deterministically.

Diagnose: causal explanation of divergence

Replay produces a diff. Diff alone answers "what changed." Diagnose answers "why, and what do I do about it." Given a baseline scroll and a replayed scroll, the diagnoser:

Aligns the two scrolls on their causal chain.
Identifies the earliest divergence point — the first event where the two streams disagree.
Reads the code paths that produced that event on each side.
Proposes a causal explanation — ordered by likelihood, grounded in specific evidence from the scrolls.
Suggests a fix and estimates the blast radius.

Divergence at sequence 43 (candidate vs baseline):

  baseline  signal_accepted    { signalId: s_9, artifact: quest_xyz }
  candidate validator_rejected { signalId: s_9, reason: "duplicate" }

Hypothesis (confidence 0.82):
  The new reconcile-quest prompt introduced a tighter
  dedup threshold. s_9 was marginally similar to s_4
  (accepted at sequence 38); the tighter threshold now
  treats it as a duplicate.

Evidence:
  - Both quests mention a raiding party on the north road
  - Embedding similarity 0.86; old threshold 0.90,
    new threshold 0.83 (inferred from prompt diff)
  - s_4 was accepted in both runs

Suggested fix:
  Two options:
    (A) Tighten the prompt further to require literal
        text overlap, not just semantic similarity.
    (B) Leave as-is — s_9 is arguably a duplicate and
        this is the desired behavior. Confirm with the
        game master.

Blast radius:
  Corpus replay shows 3 more scrolls with similar pattern.
  2 of 3 changed the same way; likely the right call.

Bisect: find the minimum prompt edit

Prompt engineering today is guess-and-check. Change the wording, spin up a test, hope for the best. Bisect replaces the guesswork with a search.

Given a failing scroll, a target prompt, and an acceptance predicate, the bisector iteratively proposes edits and verifies them against a sandboxed replay. Its search is adaptive — it hypothesizes which parts of the prompt matter, tests the smallest relevant change first, and reports the minimum delta.

$ scry bisect \
    --failing quest_signals:tavern-42 \
    --prompt prompts/reconcile_quest.tmpl \
    --acceptance "signal_accepted for s_9 on scroll tavern-42" \
    --corpus ./corpus \
    --out bisect/

# Output:
#   bisect/minimum-edit.diff       — the smallest change that fixes it
#   bisect/regression-impact.json  — which corpus cases change
#   bisect/narration.md            — narrator's explanation of the edit
#   bisect/session-scroll.jsonl    — the scry session for this run

The output is a proposal, not a commit. Claude Code (or a human) reviews the diff, checks the regression impact, decides whether to apply. The decision is easier because the impact is measured, not estimated.

Provider independence

Scry agents default to Gemini. That's a considered choice, not an accident.

Context window. Scroll analysis often needs the whole scroll plus the reactor code plus the corpus. Gemini's 2M-token context fits this shape.
Independence from production. If the production model is OpenAI and the analyst is also OpenAI, they share blind spots. Using a different family for the analytical layer keeps the diagnostic lens epistemically independent.
Thinking mode. Bisection and counterfactual analysis are deep-reasoning workloads; Gemini 2.5 Pro with thinking enabled is well-shaped for them.
Economics. Conformance sweeps across a repo, nightly corpus regressions, speculative eval runs — Flash's price/performance makes these cadences affordable.

Provider choice is configurable via the router (see internal/ai/router.go). Teams with existing vendor contracts can swap by registering providers and renaming models. This is the dogfood test for weave's abstraction: if swapping providers is hard, the abstraction has gaps, and Scry is the first thing to feel them.

Why this closes the loop

Reasoning about an LLM system without an event log is fundamentally speculative — you don't know what the model saw, what it emitted, which tools fired in what order. Adding an event log gives you what; adding a replay engine gives you could have; adding an analytical agent on top gives you why.

Those three layers together close the maintenance loop: the system records what happened, replay reproduces counterfactuals, and the analytical agent explains the gap. No single layer is useful alone. Together, they are what makes maintenance of agentic workflows tractable at production scale.

Status

Narrator ships today. Diagnose and bisect land as Scry v1+. The infrastructure for all three — router, scroll-first replay, session scrolls — is already in place.

Narrator (scry narrate)

implemented

internal/scry/narrator.go. Writes a scry:session scroll per run (narrate_started, ai.request, ai.response, narrate_completed). Session scroll is itself narratable, diffable, forkable.

Diagnose on replay divergence

designed

Inputs: baseline scroll, replayed scroll, code. Output: structured divergence record + causal explanation + suggested fix + confidence score.

Prompt bisection

designed

Inputs: failing scroll, prompt file, acceptance predicate. Output: minimum edit that makes the case pass without regressing the corpus. Search strategy is adaptive.

Multi-provider independence

implemented

internal/ai/router.go routes by model prefix across OpenAI, Gemini, Anthropic. Scry agents default to Gemini so analysis doesn't share blind spots with the production model.

Interactive diagnose session

designed

Conversational mode: ask follow-up questions about a diagnosis, drill into specific events, request alternative hypotheses. Each session is its own scroll — replayable later.

Cross-scroll pattern detection

designed

'Which production scrolls show symptoms of the bug we just fixed?' Runs a diagnostic query across a corpus, returns matching scrolls ranked by similarity.