Sandboxed replay
Fork/substitute/replay handles the logical isolation — the origin scroll is never mutated. But a workflow can call real tools that write to real databases, send real emails, charge real cards. Sandboxed replay handles the side-effect isolation too: spawn an isolated scroll-server and runner, stub the externals, run the workflow, tear down.
Why logical isolation isn't enough
A scroll is immutable. A fork is a new scroll. Substituting an event on a fork produces a new derived event stream without touching the origin. All of that is clean.
What isn't clean: the runner, while replaying the fork,
might call a tool handler. That handler is a live http.Handler — it reads from a database,
writes to a database, hits a payment provider, emits a
webhook. Replaying the fork with that handler pointed at
production is not replay; it's re-executing with mutated
inputs against the same production systems. The fork is
isolated; the side effects are not.
Sandboxing closes that gap. The sandboxed runner has its own scroll-server, its own tool-handler wiring, and its own rate-limit ceilings. It runs the workflow end-to-end. Production is untouched.
Spawn model
Spawning a sandbox is itself a weave workflow inside Scry. No special infrastructure: the scroll-server and runner already support multi-tenant deployment; Scry just orchestrates "start these processes, wire them to this backend, hand me a handle" using idiomatic weave code.
$ scry sandbox spawn \
--corpus prod-tavern-100 \
--tool-mode capture-and-replay \
--ttl 30m
Sandbox ready: sbx-a7f2
scroll-server: 127.0.0.1:40001 (ephemeral)
runner: 127.0.0.1:40002 (ephemeral)
corpus: 100 scrolls mirrored
tool mode: capture-and-replay (defaults per tool)
ttl: 30m
$ scry replay-corpus ./prod-tavern-100 \
--sandbox sbx-a7f2 \
--against HEAD \
--out results/
$ scry sandbox teardown sbx-a7f2
aggregated 100 results to scry:analysis:...
torn down.Everything about the sandbox — its scroll-server, its runner, its tool handlers — is a weave service surface. Spawning, mirroring, configuring, running, aggregating, tearing down are all scroll events on Scry's own session scroll. Replay the Scry session and you replay the orchestration itself.
Tool stubbing modes
The critical isolation boundary is at the tool handler. The sandbox offers four modes, chosen per tool at spawn time.
Capture-and-replay stubs
The default. The first sandbox run captures the tool handler's response for a given input; subsequent runs replay the captured value. Deterministic across repeats.
Custom stub handlers
Register an http.Handler that produces fixture responses. Used when the baseline capture is wrong (a known bug in the original handler) or when you want to test specific edge cases.
Pass-through with quarantine
Let the tool call go through to a staging version of the external system. Useful for integration testing. Side effects land on a quarantined namespace that can be cleaned up after the sandbox tears down.
Hard fail
Any tool call raises an error — the sandbox refuses to run a workflow that would have touched production. Useful as a safety check for code paths you believe shouldn't call externals at all.
Most sandboxed replays use capture-and-replay for most tools, hard-fail for dangerous ones (payment, email), and custom stubs for the tools under active development. The combination gives deterministic behavior under safe defaults, with controlled escape hatches where needed.
Use cases
Shadow eval at scale
Replay a prompt change against 500 production scrolls. Each replay runs inside a sandbox so it doesn't compete with production runner capacity or touch production tool handlers. Aggregate results back into Scry's analysis scroll, tear down.
Incident replay with a candidate fix
A production incident scroll references tool calls that mutate external systems (write a database row, send an email, charge a card). Replaying as-is would re-fire the side effects. Sandbox replaces the tool handlers with deterministic stubs so the workflow runs end-to-end, safely.
Architecture migration validator
The port from in-process weave to the service-based ecosystem is the canonical use. Same input scroll, two pipelines — old (in-process) and new (service-based) — run side by side in a sandbox. Diff the outputs. Divergences surface before cutover traffic. The scenario below walks through the tavern quest-board variant.
Periodic regression sweep
Nightly background workflow. Pull recent production scrolls, replay each against HEAD inside a sandbox, diff against recorded outcomes, file issues for divergences. Runs on its own isolated infrastructure so it competes with nothing.
Experimental reactor prototyping
Claude Code sketches a new reactor, wants to see how it behaves on real-shaped traffic. Spawn a sandbox, mirror a corpus into it, register the new reactor, replay, observe. Nothing leaks into the production substrate.
The canonical worked example: the in-process to service port
The highest-value use of the sandbox is the architecture migration validator. The shape:
- The tavern's quest-board pipeline runs in-process — scroll library embedded in the service binary, reactors as goroutines, tool handlers as ambient subscribers.
- A candidate refactor moves each piece to its service-based form — networked scroll-server, reactors as consumer-group subscribers, runner as its own service.
- Both pipelines must produce identical outputs on identical inputs for every case in the production corpus. Divergences either mean bugs in the new code or surface invariants the old code was silently violating.
- Scry spawns a sandbox with both pipelines wired up. For each scroll in the corpus: replay through the in-process pipeline, replay through the service pipeline, diff the outputs.
- The report: per-scroll verdict (match / diverge / error), root cause for each divergence (diagnosed by Scry's diagnoser agent), and an aggregate confidence score for the refactor.
Without the sandbox, this validation is impossible — running both pipelines against production corpora requires either mutating production or building a bespoke test harness. The sandbox makes it a one-command operation, isolated by construction, resumable, and inspectable after the fact.
What sandboxing preserves and what it doesn't
Preserved. Every scroll operation (append, read, subscribe, fold, project) behaves identically in the sandbox. The runner's scroll-first replay works. Reactors register normally. Tool dispatch works via the sandbox's scroll-server. The workflow doesn't know it's sandboxed.
Not preserved. External latencies — if a production tool handler takes 400ms, the sandboxed stub may return in microseconds. Absolute clock-time coupling (anything that depends on wall-clock or per-request entropy). Cross-tenant noise — sandboxed runs see only their own corpus. For fidelity-sensitive experiments, the stubs can be configured to introduce jitter and latency; for most uses, the speed-up is a feature, not a bug.
Status
Sandboxed replay is a v2 Scry capability. It depends on the core scroll-server fork/substitute surface plus enough operational polish around spinning up scroll-server + runner pairs. The pieces exist; the orchestration does not.