← Back to Home

Your Agent Doesn't Need to See Every Step

Batching browser actions without flying blind

Part II of II • Part I: The Economics of "Chatty" Agents

In Part I, we talked about how agents can burn money and time by treating every click as a separate interaction with your browser infrastructure.

The fix was to change the unit of work:

From: "one action = one request"

To: "one plan (many actions) = one request"

We used Riddle's /v1/run endpoint with steps or script mode to show how a single job can do multiple actions and screenshots inside one session, instead of paying per job.

This post is about the scary question that comes next:

When is it safe to batch? How do you keep your agent from flying blind, and what happens when a batched script fails halfway through?

The Default Pattern: The "Vision Loop"

Right now, the orthodoxy looks like this:

  1. Screenshot
  2. LLM thinks
  3. Click
  4. Repeat

Every tiny step gets visual verification. It's the safest and simplest loop—but also the slowest and most expensive.

The contrarian take is: your agent doesn't need to see every step.

Most web flows have long deterministic stretches. If you click "Submit" on a form, you usually don't need five screenshots to know if it worked. The URL, DOM, and network responses already tell you a lot.

So instead of:

Screenshot → think → click → screenshot → think → click…

You want:

  1. Screenshot once at the start
  2. Run a deterministic sequence "blind"
  3. Screenshot once at the end to verify

The trick is knowing where you can safely do that.

What Makes a Step "Batch-Safe"?

A sequence of actions is "batch-safe" when the world is predictable enough that you don't need constant LLM supervision.

Some rules of thumb:

1. The Path Is Deterministic

Good candidates:

  • Filling a known form with fixed fields
  • Clicking stable buttons/links with strong selectors
  • Opening a menu and choosing a specific item

Bad candidates:

  • Choosing options based on arbitrary page text
  • Navigating content that's heavily A/B-tested or personalized
  • Handling CAPTCHAs or login challenges

If the next step depends on free-form text the LLM has never seen, don't batch it. If it's "fill three fields and click Submit," you probably can.

2. Selectors Are Robust, Not Brittle

Batching assumes you know what you're interacting with.

Good signs:

  • Data attributes: data-test="login-button"
  • Stable IDs/classes agreed on with the product team
  • Explicit test hooks already used in QA

Red flags:

  • Matching on raw button text that changes ("Continue" → "Next")
  • Very long CSS selectors that encode layout details
  • "Nth child of nth child" vibes

The more fragile the selector, the more you'll want intermediate checkpoints.

3. Success Is Observable Without Screenshots

If you can confirm success from state, you don't need constant vision.

Examples of good signals:

  • URL: e.g. /dashboard vs /login
  • DOM: presence of a [data-test='dashboard-root'] element
  • Network: a 200 OK from /api/dashboard with a known JSON shape

When you can say "this step worked if X is true," you can safely bake that into the script.

4. The Blast Radius of Failure Is Small

Even with good heuristics, stuff breaks.

Batch more aggressively when:

  • You're in test or staging environments
  • Actions are idempotent or easily reversible
  • The worst case is "we have to try again," not "we bought 10,000 widgets"

In high-risk flows (checkout, irreversible mutations), you can still batch—but you'll want more frequent checkpoints and tighter checks.

Heuristics for "I Need to Look Again"

You don't want a screenshot after every click, but you do want the script to know when it's time to wake the LLM up.

Here are practical "wake-up" heuristics:

Unexpected Navigation

  • URL changes to /login, /error, /blocked, or another "danger" pattern
  • Domain or origin changes (unusual redirects)

DOM Divergence

  • A required selector doesn't appear within a timeout
  • An error banner shows up ([role='alert'], .error-message)

Network Anomalies

  • A key request returns 4xx/5xx
  • The API returns an error shape instead of the expected payload

Timing Weirdness

  • Spinner never disappears
  • Page transition takes significantly longer than normal

When any of those fire, the right move is:

  1. Take a fresh screenshot
  2. Capture context (URL, key network responses, relevant DOM snippets)
  3. Stop the script and return control to the agent

The LLM gets rich evidence right when the world stops being predictable.

A Concrete Pattern: Checks and Early Aborts

Let's turn this into something you can actually use.

We'll extend the basic step types with checks—small blocks that assert "the world looks how we expect" and decide whether to keep going.

// Step types for structured scripts
type Step =
  | { goto: string }
  | { screenshot: string }
  | { fill: { selector: string; value: string } }
  | { click: string }
  | { waitFor: string; timeout?: number }
  | { check: Check[]; onFail?: ("screenshot" | "abort")[] }
  | { eval: string };  // escape hatch for raw Playwright

type Check =
  | { urlIncludes: string }
  | { urlExcludes: string }
  | { selectorExists: string; timeout?: number }
  | { selectorMissing: string }
  | { noHttpErrors: string };  // URL pattern to watch

Example: "Log in and open a report"

Here's a script that only screenshots at the boundaries, but has a rich mid-batch check:

const steps = [
  // 1. Go to login & capture initial state
  { goto: "https://app.example.com/login" },
  { screenshot: "login-page" },

  // 2. Fill login form and submit (no extra screenshots)
  { fill: { selector: "input[name='email']", value: "demo@example.com" } },
  { fill: { selector: "input[name='password']", value: "secret" } },
  { click: "button[type='submit']" },

  // 3. Wait for dashboard
  { waitFor: "[data-test='dashboard-root']", timeout: 15000 },

  // 4. CHECK: did we actually land on the dashboard?
  {
    check: [
      { urlIncludes: "/dashboard" },
      { selectorExists: "[data-test='dashboard-root']" },
      { noHttpErrors: "/api/dashboard" }
    ],
    onFail: ["screenshot", "abort"]
  },

  // 5. If checks passed, continue "blind"
  { click: "[data-test='open-reports']" },
  { waitFor: "[data-test='report-row']", timeout: 10000 },

  // 6. Final screenshot
  { screenshot: "dashboard-reports" }
];

const response = await fetch("https://api.riddledc.com/v1/run", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${API_KEY}`,
    "Content-Type": "application/json"
  },
  body: JSON.stringify({ steps })
});

const result = await response.json();

if (result.status === "failed") {
  console.error("Failed at:", result.failedStep);
  console.error("Reason:", result.reason);
  // result.screenshots includes the failure screenshot
} else {
  console.log("Success:", result.screenshots);
}

What This Buys You

  • Only two screenshots: login-page at the start, dashboard-reports at the end
  • Middle is "blind but deterministic": fill, click, wait, click, wait—all inside the browser, not round-tripping through the LLM
  • Checks keep you honest: if you hit /login instead of /dashboard, or the dashboard root never appears, or /api/dashboard returns 500, the script takes a screenshot, aborts, and returns a structured failure

From an agent's point of view, this is gold: a rich chunk of evidence at the exact moment things went sideways, not 20 tiny screenshots that all look the same.

How the Agent Should Think About Batched Scripts

A simple mental model for your LLM:

  1. Plan in segments, not microscopic steps
    • "Segment 1: log in"
    • "Segment 2: open report"
    • "Segment 3: export CSV"
  2. For each segment, ask:
    • Is this path deterministic?
    • Do I have strong selectors and simple success checks?
    • What are reasonable failure heuristics?
  3. Emit a script for the whole segment, with:
    • One screenshot at the start
    • One screenshot at the end
    • Optional checks in the middle that can abort and screenshot on failure
  4. On failure, use the returned context to:
    • Diagnose why the checks failed
    • Adjust selectors, expectations, or the overall strategy
    • Try again, or fall back to a more cautious, vision-heavy approach

You're using expensive LLM tokens where the world is uncertain and cheap browser time where it isn't.

Putting It Together

Combined, the two posts give you a simple playbook:

  • Part I (Economics): Stop paying per click. Think in sessions. Pack as much work as you can into a single API call. Time is cheap after the first 30 seconds.
  • Part II (Safety & Design): Your agent doesn't need to see every step. Batch deterministic segments, add lightweight checks, and wake the LLM only when something looks off.

If you're building browser agents and you're tired of babysitting every mouse move, try structuring your next workflow as:

One screenshot → one plan → one API call → one rich result.

Let the browser handle the boring, deterministic work—so your agent can save its intelligence for the parts where the world really is uncertain.

Ready to Build Smarter Agents?