How to Give Your AI Agent Eyes
Your AI agent is brilliant at reasoning. It can write code, analyze data, and plan complex tasks. But ask it to check if a button is visible on a webpage? It's blind.
This is the vision gap in AI agents—and it's easier to close than you think.
The Problem: Agents Can't See
Most AI agent frameworks (LangChain, CrewAI, AutoGPT, Browser-Use) hit the same wall: the agent needs to see a webpage to interact with it.
The typical solution? Spin up a headless browser. But then you're dealing with:
- Chrome eating 500MB+ of RAM per instance
- Cold starts killing your Lambda functions
- Puppeteer version conflicts breaking your CI
- Memory leaks crashing your agent mid-task
You didn't sign up to become a browser infrastructure engineer. You just wanted your agent to see a webpage.
The Solution: Screenshots as a Service
What if your agent could just... ask for a screenshot?
import requests
import base64
response = requests.post(
"https://api.riddledc.com/v1/run",
headers={"Authorization": f"Bearer {API_KEY}"},
json={"url": "https://example.com"}
)
# PNG bytes returned directly - base64 encode for vision LLM
screenshot_b64 = base64.b64encode(response.content).decode()
# Feed to GPT-4V, Claude, etc.That's it. No browser to manage. No Chrome to update. No memory to debug. Sync by default—send a request, get a screenshot back.
The screenshot comes back in 2-3 seconds. Your agent feeds it to a vision model. The vision model tells the agent what it sees. The agent decides what to do next.
Cost: from $0.004 per job (30s minimum). Batch multiple screenshots into one job to get below $0.001 each.
The Vision Loop
Here's the pattern that works:
This is how humans browse the web. Look, think, act, look again. Your agent can do the same thing.
Example: Finding a "Sign Up" Button
# Step 1: See the page
screenshot = take_screenshot("https://example.com")
# Step 2: Ask vision model what's there
response = vision_model.analyze(
image=screenshot,
prompt="Find the Sign Up button. What does it say and where is it located?"
)
# Step 3: Vision model responds
# "There's a blue 'Sign Up Free' button in the top right corner of the page"
# Step 4: Agent acts on this information
click_element(".signup-btn")
# Step 5: Verify it worked
screenshot = take_screenshot("https://example.com/signup")
# "Now showing a registration form with email and password fields"The agent never parses HTML. Never fights with selectors. It just looks at the page like a human would.
When Screenshots Aren't Enough
Sometimes you need to do more than look. You need to:
- Fill out a form
- Click through a multi-step wizard
- Log in before capturing a dashboard
- Scroll to load dynamic content
For these, you need browser automation. But you still don't need to run Chrome yourself.
Option 1: Steps Mode (JSON, great for LLMs)
response = requests.post(
"https://api.riddledc.com/v1/run",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"steps": [
{"goto": "https://app.example.com/login"},
{"fill": {"selector": "#email", "value": "user@example.com"}},
{"fill": {"selector": "#password", "value": "secret"}},
{"click": "button[type=submit]"},
{"waitFor": ".dashboard"},
{"screenshot": "logged-in-dashboard"}
]
}
)Steps mode is ideal for AI agents—structured JSON that's easy to generate and validate.
Option 2: Script Mode (full Playwright control)
script = """
await page.goto("https://app.example.com/login");
await page.fill("#email", "user@example.com");
await page.fill("#password", "secret");
await page.click("button[type=submit]");
// Wait for dashboard to load
await page.waitForSelector(".dashboard");
await saveScreenshot("logged-in-dashboard");
"""
response = requests.post(
"https://api.riddledc.com/v1/run",
headers={"Authorization": f"Bearer {API_KEY}"},
json={"script": script}
)Script mode gives you full Playwright API when you need loops, conditionals, or complex logic.
The Debugging Superpower: console.json
Here's something I discovered while building agents with this API: logging is everything.
When your automation fails, you need to know where it failed. Did the page load? Did the button exist? Did the click work?
Every job captures all browser console output by default. Every console.log(),console.error(), and JavaScript error—automatically saved to console.json.
Even simple screenshot jobs capture this. The job ID comes back in the X-Job-Id response header, so you can always fetch the logs later:
# Your request returns PNG bytes with X-Job-Id header
curl -D - "https://api.riddledc.com/v1/run" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{"url": "https://example.com"}' -o screenshot.png
# X-Job-Id: job_abc123
# Something look wrong? Check the logs:
curl "https://api.riddledc.com/v1/jobs/job_abc123/artifacts" \
-H "Authorization: Bearer YOUR_API_KEY"
# console.json shows all browser console output, JS errors, network failuresFor script mode, add your own logging to trace execution:
console.log("Step 1: Navigating to login page");
await page.goto(url);
console.log("Step 2: Looking for email field");
const emailField = await page.$("#email");
console.log("Email field found:", emailField ? "yes" : "no");
console.log("Step 3: Filling form");
await page.fill("#email", email);
console.log("Done!");When something breaks, console.json tells you exactly what happened:
{
"entries": {
"log": [
{"message": "Step 1: Navigating to login page"},
{"message": "Step 2: Looking for email field"},
{"message": "Email field found: no"}
],
"error": [
{"message": "Uncaught exception: Element not found: #email"}
]
}
}The email field wasn't there. Maybe the selector changed. Maybe you're on the wrong page. Either way, you're not guessing—you have the trail.
Zooming In: When Full-Page Screenshots Aren't Enough
Full-page screenshots are great for understanding layout. But sometimes you need to read small text or analyze a specific component.
The solution: element-focused screenshots.
await page.goto("https://example.com");
// Find the element you care about
const form = await page.$("form.checkout");
// Option 1: Direct element screenshot (simplest)
await form.screenshot({path: "checkout-form.png"});
// Option 2: With padding for context
const box = await form.boundingBox();
await page.screenshot({
path: "form-with-context.png",
clip: {
x: box.x - 20,
y: box.y - 20,
width: box.width + 40,
height: box.height + 40
}
});This is like zooming in with a camera. Instead of a 1920x3000 pixel full-page image where the form is tiny, you get a focused 400x300 image where every field is clearly readable.
Your vision model will thank you.
Authentication Without the Pain
The most common agent task? Accessing authenticated pages.
Dashboards, admin panels, user accounts—they're all behind login. And login flows are slow, fragile, and expensive (burning 10-30 seconds of compute each time).
The better approach: inject authentication directly.
Session Cookies
{
"url": "https://app.example.com/dashboard",
"options": {
"cookies": [
{"name": "session", "value": "abc123", "domain": "app.example.com"}
]
}
}Bearer Tokens
{
"url": "https://api.example.com/admin",
"options": {
"headers": {
"Authorization": "Bearer eyJhbG..."
}
}
}localStorage (for SPAs)
{
"url": "https://spa.example.com",
"options": {
"localStorage": {
"authToken": "eyJhbG..."
}
}
}Your agent captures the auth tokens once (via a login flow), then reuses them for every subsequent request. No more waiting for login pages to load.
Cost Math That Actually Works
Let's talk money.
Single job: from $0.004 (30-second minimum billing at $0.50/hour)
But here's the trick: That 30-second minimum covers multiple screenshots in the same job.
1 screenshot/job = ~$0.004 → $0.004 each
5 screenshots/job = ~$0.004 → $0.0008 each
10 screenshots/job = ~$0.004 → $0.0004 eachBatch mode: Send multiple URLs, get multiple screenshots, pay once.
response = requests.post(
"https://api.riddledc.com/v1/run",
json={"urls": [url1, url2, url3, url4, url5]}
)
# 5 screenshots for one job (~$0.004 total)Steps/script with multiple captures: Navigate through a flow, screenshot at each step.
// Steps mode (JSON)
{"steps": [
{"goto": "https://example.com"},
{"screenshot": "step1"},
{"click": ".next"},
{"screenshot": "step2"},
{"click": ".next"},
{"screenshot": "step3"}
]}
// Or script mode (Playwright)
{"script": "await page.goto(url); await saveScreenshot('step1'); ..."}At scale:
- 100 screenshots/day = ~$4/month (with batching)
- 1,000 screenshots/day = ~$30/month
- Your agent can look at a lot of pages for very little money.
What I Learned Building Vision Agents
After running 50+ API jobs while building and testing vision-enabled agents, here's what actually matters:
1. Always screenshot before acting
Don't assume you know what's on the page. Look first. The page might have changed, an error might have appeared, or the element might not be where you expected.
2. Log everything in scripts
console.log() is free and invaluable. When something fails at 3 AM, you'll have a trail to follow.
3. fullPage=true is the default
Your 667px viewport will produce a 2000px screenshot. Set fullPage: false if you want viewport-only captures.
4. Zoom in for detail work
Full-page screenshots are overview maps. Element-focused screenshots are street view. Use both.
5. Inject auth, don't log in
Login flows are slow and fragile. Cookie injection is fast and reliable. Capture tokens once, reuse forever.
6. Batch when possible
Five separate jobs = ~$0.02. One job with five screenshots = ~$0.004. Same screenshots, 5x cheaper.
Getting Started
1. Get an API key at riddledc.com
2. Take your first screenshot:
curl -X POST "https://api.riddledc.com/v1/run" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}' \
-o screenshot.png
# PNG bytes returned directly (sync by default)3. Feed it to your vision model and start building.
That's it. No polling, no job IDs to track. The screenshot comes back in the response.
The Bottom Line
Your AI agent doesn't need to manage browsers. It doesn't need to fight with Puppeteer. It doesn't need gigabytes of RAM.
It just needs eyes.
One API call. Results returned directly. From $0.004 per job.
Now your agent can see.