Why logical isolation isn't enough

A scroll is immutable. A fork is a new scroll. Substituting an event on a fork produces a new derived event stream without touching the origin. All of that is clean.

What isn't clean: the runner, while replaying the fork, might call a tool handler. That handler is a live http.Handler — it reads from a database, writes to a database, hits a payment provider, emits a webhook. Replaying the fork with that handler pointed at production is not replay; it's re-executing with mutated inputs against the same production systems. The fork is isolated; the side effects are not.

Sandboxing closes that gap. The sandboxed runner has its own scroll-server, its own tool-handler wiring, and its own rate-limit ceilings. It runs the workflow end-to-end. Production is untouched.

Spawn model

Spawning a sandbox is itself a weave workflow inside Scry. No special infrastructure: the scroll-server and runner already support multi-tenant deployment; Scry just orchestrates "start these processes, wire them to this backend, hand me a handle" using idiomatic weave code.

$ scry sandbox spawn \
    --corpus prod-tavern-100 \
    --tool-mode capture-and-replay \
    --ttl 30m

Sandbox ready: sbx-a7f2
  scroll-server:  127.0.0.1:40001  (ephemeral)
  runner:         127.0.0.1:40002  (ephemeral)
  corpus:         100 scrolls mirrored
  tool mode:      capture-and-replay (defaults per tool)
  ttl:            30m

$ scry replay-corpus ./prod-tavern-100 \
    --sandbox sbx-a7f2 \
    --against HEAD \
    --out results/

$ scry sandbox teardown sbx-a7f2
  aggregated 100 results to scry:analysis:...
  torn down.

Everything about the sandbox — its scroll-server, its runner, its tool handlers — is a weave service surface. Spawning, mirroring, configuring, running, aggregating, tearing down are all scroll events on Scry's own session scroll. Replay the Scry session and you replay the orchestration itself.

Tool stubbing modes

The critical isolation boundary is at the tool handler. The sandbox offers four modes, chosen per tool at spawn time.

Capture-and-replay stubs

The default. The first sandbox run captures the tool handler's response for a given input; subsequent runs replay the captured value. Deterministic across repeats.

Custom stub handlers

Register an http.Handler that produces fixture responses. Used when the baseline capture is wrong (a known bug in the original handler) or when you want to test specific edge cases.

Pass-through with quarantine

Let the tool call go through to a staging version of the external system. Useful for integration testing. Side effects land on a quarantined namespace that can be cleaned up after the sandbox tears down.

Hard fail

Any tool call raises an error — the sandbox refuses to run a workflow that would have touched production. Useful as a safety check for code paths you believe shouldn't call externals at all.

Most sandboxed replays use capture-and-replay for most tools, hard-fail for dangerous ones (payment, email), and custom stubs for the tools under active development. The combination gives deterministic behavior under safe defaults, with controlled escape hatches where needed.

Use cases

Shadow eval at scale

Replay a prompt change against 500 production scrolls. Each replay runs inside a sandbox so it doesn't compete with production runner capacity or touch production tool handlers. Aggregate results back into Scry's analysis scroll, tear down.

Incident replay with a candidate fix

A production incident scroll references tool calls that mutate external systems (write a database row, send an email, charge a card). Replaying as-is would re-fire the side effects. Sandbox replaces the tool handlers with deterministic stubs so the workflow runs end-to-end, safely.

Architecture migration validator

The port from in-process weave to the service-based ecosystem is the canonical use. Same input scroll, two pipelines — old (in-process) and new (service-based) — run side by side in a sandbox. Diff the outputs. Divergences surface before cutover traffic. The scenario below walks through the tavern quest-board variant.

Periodic regression sweep

Nightly background workflow. Pull recent production scrolls, replay each against HEAD inside a sandbox, diff against recorded outcomes, file issues for divergences. Runs on its own isolated infrastructure so it competes with nothing.

Experimental reactor prototyping

Claude Code sketches a new reactor, wants to see how it behaves on real-shaped traffic. Spawn a sandbox, mirror a corpus into it, register the new reactor, replay, observe. Nothing leaks into the production substrate.

The canonical worked example: the in-process to service port

The highest-value use of the sandbox is the architecture migration validator. The shape:

The tavern's quest-board pipeline runs in-process — scroll library embedded in the service binary, reactors as goroutines, tool handlers as ambient subscribers.
A candidate refactor moves each piece to its service-based form — networked scroll-server, reactors as consumer-group subscribers, runner as its own service.
Both pipelines must produce identical outputs on identical inputs for every case in the production corpus. Divergences either mean bugs in the new code or surface invariants the old code was silently violating.
Scry spawns a sandbox with both pipelines wired up. For each scroll in the corpus: replay through the in-process pipeline, replay through the service pipeline, diff the outputs.
The report: per-scroll verdict (match / diverge / error), root cause for each divergence (diagnosed by Scry's diagnoser agent), and an aggregate confidence score for the refactor.

Without the sandbox, this validation is impossible — running both pipelines against production corpora requires either mutating production or building a bespoke test harness. The sandbox makes it a one-command operation, isolated by construction, resumable, and inspectable after the fact.

What sandboxing preserves and what it doesn't

Preserved. Every scroll operation (append, read, subscribe, fold, project) behaves identically in the sandbox. The runner's scroll-first replay works. Reactors register normally. Tool dispatch works via the sandbox's scroll-server. The workflow doesn't know it's sandboxed.

Not preserved. External latencies — if a production tool handler takes 400ms, the sandboxed stub may return in microseconds. Absolute clock-time coupling (anything that depends on wall-clock or per-request entropy). Cross-tenant noise — sandboxed runs see only their own corpus. For fidelity-sensitive experiments, the stubs can be configured to introduce jitter and latency; for most uses, the speed-up is a feature, not a bug.

Status

Sandboxed replay is a v2 Scry capability. It depends on the core scroll-server fork/substitute surface plus enough operational polish around spinning up scroll-server + runner pairs. The pieces exist; the orchestration does not.

scry spawn (isolated scroll + runner)

designed

Boots a new scroll-server instance and a runner service wired to it. Returns a handle. Torn down when the handle closes. Lifecycle is itself a weave workflow inside Scry.

Corpus mirroring

designed

Replicate scrolls from the production scroll-server into the sandbox's scroll-server. Lineage metadata carries forward so Scry knows which sandbox scroll corresponds to which production scroll.

Tool handler stubbing

designed

Four modes: capture-and-replay, custom, pass-through-with-quarantine, hard-fail. Chosen per-tool at spawn time.

Resource limits & backpressure

designed

Sandbox runs with explicit CPU, memory, and LLM-quota ceilings. Never competes with production for provider rate limits.

Teardown + result aggregation

designed

On teardown, relevant results (diffs, metrics, narrated summaries) are written to Scry's analysis scroll on the main substrate. The sandbox's scrolls can be discarded or archived.

Multi-tenant sandboxing

designed

Ten analysts can each hold their own sandbox without collision. Sandboxes are addressable; handoff is a matter of sharing the sandbox ID.