Docs/Visual_Mode

Screenshot + Geometry Legacy

Capture high-fidelity screenshots alongside precise element coordinates. Perfect for multimodal AI agents, visual debugging, and screenshot-based workflows.

Compute Cost
10 Credits
per request
Latency
~1-2s
with screenshot
Output
PNG
base64 data URI

What is Visual Mode?

Visual Mode combines the power of Map Mode's element coordinate mapping with high-fidelity screenshot capture. This enables multimodal AI workflows where agents can both "see" the webpage through pixels and understand its structure through geometry data.

Key Features

  • Screenshot Capture: High-quality PNG screenshot of the rendered viewport
  • Element Coordinates: Full geometry data with bounding boxes for all interactive elements
  • Multimodal Ready: Designed for GPT-4V, Claude 3, and other vision-language models
  • Accurate Rendering: Uses Precision Engine with full CSS/image loading for pixel-perfect screenshots
  • Data URI Format: Screenshots returned as base64 data URIs, ready for immediate use

Visual Mode vs Map Mode

Understanding when to use each mode

Map Mode

2 credits

Fast geometry extraction for navigation and automation. No screenshot included.

Response Time:<400ms
Screenshot:No
Best For:Automation

Visual Mode

10 credits

Geometry + screenshot for visual verification and multimodal AI workflows.

Response Time:~1-2s
Screenshot:Yes
Best For:Verification

Request Format

Send a POST request to /v1/observe with Visual Mode parameters

Basic Request

1curl -X POST https://api.sentienceapi.com/v1/observe \
2  -H "Authorization: Bearer sk_live_..." \
3  -H "Content-Type: application/json" \
4  -d '{
5    "url": "https://app.example.com/login",
6    "mode": "visual",
7    "options": {
8      "screenshot_delivery": "base64"
9    }
10  }'

With Smart FilteringRecommended

1curl -X POST https://api.sentienceapi.com/v1/observe \
2  -H "Authorization: Bearer sk_live_..." \
3  -H "Content-Type: application/json" \
4  -d '{
5    "url": "https://app.example.com/login",
6    "mode": "visual",
7    "options": {
8      "screenshot_delivery": "base64",
9      "limit": 50,
10      "filter": {
11        "allowed_roles": ["button", "textbox", "link"]
12      }
13    }
14  }'

With Visual CuesNEW

Get visual styling hints (color names, cursor type, prominence) for icon-heavy UIs. Adds ~17 tokens per element.

1curl -X POST https://api.sentienceapi.com/v1/observe \
2  -H "Authorization: Bearer sk_live_..." \
3  -H "Content-Type: application/json" \
4  -d '{
5    "url": "https://figma.com/design/123",
6    "mode": "visual",
7    "options": {
8      "screenshot_delivery": "base64",
9      "include_visual_cues": true,
10      "limit": 100
11    }
12  }'

💡 Use Case: Perfect for icon-heavy UIs and design tools. Returns color names, cursor type, and prominence detection to help AI agents identify icon-only buttons and primary CTAs.

URL Delivery (Recommended for AI Agents)NEW

Returns presigned URLs instead of base64. Reduces payload size by 99.9% (~200KB vs ~1MB). Perfect for OpenAI/Anthropic vision APIs. URLs expire in 24 hours.

1curl -X POST https://api.sentienceapi.com/v1/observe \
2  -H "Authorization: Bearer sk_live_..." \
3  -H "Content-Type: application/json" \
4  -d '{
5    "url": "https://app.example.com/login",
6    "mode": "visual",
7    "options": {
8      "screenshot_delivery": "url"
9    }
10  }'

💡 Benefits: 99.9% smaller payloads, faster JSON parsing, native OpenAI/Anthropic support. Use screenshot_delivery: "url" for production AI agents.

Request Parameters

urlrequiredstring

The URL of the webpage to capture. Must be a valid HTTP or HTTPS URL.

moderequiredstring

Must be set to "visual" for Visual Mode.

optionsoptionalobject

Advanced options for fine-tuning Visual Mode behavior.

options.limit

Maximum number of elements to return. Reduces token costs when sending to vision models.

options.screenshot_deliveryNEW

Screenshot delivery mode: "base64" (default, inline base64 ~1MB) or "url" (presigned URL ~200KB, recommended for AI agents). URLs expire in 24h.

options.filter

Filter elements by attributes (same as Map Mode). Supports min_area, allowed_tags, allowed_roles, min_z_index.

options.include_visual_cues

Enable visual styling hints (color names, cursor type, prominence). Adds ~17 tokens per element. Useful for icon-heavy UIs.

Response Format

Screenshot + element coordinates in a single response

Example Response

Example response with new fields: is_occluded (always included) and visual_cues (when include_visual_cues=true)

1{
2  "engine": "precision",
3  "source": "precision_visual",
4  "status": "success",
5  "url": "https://app.example.com/login",
6  "layout_viewport": {
7    "width": 1024,
8    "height": 768,
9    "device_pixel_ratio": 1.0
10  },
11  "interactable_elements": [
12    {
13      "id": 1,
14      "uid": "email",
15      "tag": "input",
16      "role": "textbox",
17      "text": "",
18      "selector": "input#email",
19      "bbox": {
20        "x": 350,
21        "y": 200,
22        "w": 300,
23        "h": 45
24      },
25      "is_visible": true,
26      "z_index": 1,
27      "in_viewport": true,
28      "is_occluded": false,
29      "attributes": {
30        "type": "email",
31        "placeholder": "you@example.com",
32        "aria_label": "Email address"
33      }
34    },
35    {
36      "id": 2,
37      "uid": "password",
38      "tag": "input",
39      "role": "textbox",
40      "text": "",
41      "selector": "input#password",
42      "bbox": {
43        "x": 350,
44        "y": 270,
45        "w": 300,
46        "h": 45
47      },
48      "is_visible": true,
49      "z_index": 1,
50      "in_viewport": true,
51      "is_occluded": false,
52      "attributes": {
53        "type": "password",
54        "aria_label": "Password"
55      }
56    },
57    {
58      "id": 3,
59      "uid": "login-btn",
60      "tag": "button",
61      "role": "button",
62      "text": "Sign In",
63      "selector": "button#login-btn",
64      "bbox": {
65        "x": 350,
66        "y": 340,
67        "w": 300,
68        "h": 50
69      },
70      "is_visible": true,
71      "z_index": 1,
72      "in_viewport": true,
73      "is_occluded": false,
74      "attributes": {
75        "aria_label": "Sign in to your account"
76      },
77      "visual_cues": {
78        "background_color_name": "blue",
79        "color_name": "white",
80        "cursor": "pointer",
81        "is_primary": true
82      }
83    }
84  ],
85  "screenshot": {
86    "type": "base64",
87    "data": "iVBORw0KGgoAAAANSUhEUgAAA...",
88    "format": "png",
89    "size_bytes": 665432
90  },
91  "screenshot_error": null,
92  "timestamp": "2025-12-18T06:26:29.812Z",
93  "total_elements_extracted": 3
94}

is_occluded: Always included. Indicates whether the element is covered by another element (detected via elementFromPoint raycasting).

visual_cues: Only included when include_visual_cues: true. Contains 4 fields: background_color_name, color_name, cursor, and is_primary.

Screenshot Field Formats

The screenshot field supports two formats based on options.screenshot_delivery:

Format 1: Base64 (default)screenshot_delivery: "base64"
"screenshot": {
  "type": "base64",
  "data": "iVBORw0KGgoAAAANSUhEUgAAA...",
  "format": "png",
  "size_bytes": 665432
}
Format 2: Presigned URL (recommended for AI agents)screenshot_delivery: "url"
"screenshot": {
  "type": "url",
  "url": "https://sentience-screenshots.sfo3.digitaloceanspaces.com/screenshots/76795555-80be-4d27-84f0-73e0a0ce68c4.png?x-id=GetObject&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=7UL6GTGZBN3M2LVTRGVD%2F20251219%2Fsfo3%2Fs3%2Faws4_request&X-Amz-Date=20251219T051045Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=06ac3c6d6111109329417b82e35edb7438a40b29d0c806c85c5e39b7eac8cc4c",
  "format": "png",
  "size_bytes": 158277,
  "expires_at": "2025-12-20T05:10:45.401346170+00:00"
}

Response Fields

screenshotUPDATED

Polymorphic object with two formats:

  • type: "base64" - Contains data field with base64 string (default)
  • type: "url" - Contains url field with presigned URL and expires_at timestamp

Format controlled by options.screenshot_delivery. Legacy string format (base64 data URI) still supported for backward compatibility.

screenshot_error

Error message if screenshot capture failed (null if successful). Geometry extraction still succeeds even if screenshot fails.

source

Set to "precision_visual" for Visual Mode responses.

interactable_elements

Array of interactive elements with coordinates (same format as Map Mode). See Map Mode docs for full field reference.

Common Use Cases

Multimodal AI Agents

Send screenshots to GPT-4V or Claude 3 for visual verification: "Is this the correct product page?" or "Did the form submit successfully?"

CAPTCHA Detection

Detect visual CAPTCHAs that block automation by analyzing screenshots with vision models.

Visual Debugging

Understand why automation fails by seeing the actual rendered state alongside element coordinates.

Visual Regression Testing

Compare screenshots before and after deployments to detect unexpected visual changes.

Screenshot Details

Format

Screenshot is returned as a polymorphic object with two delivery modes:

Base64 (Default):

{ "type": "base64", "data": "iVBORw0KGgo...", "format": "png" }

Presigned URL:

{ "type": "url", "url": "https://...", "expires_at": "2025-12-19T..." }
Delivery Mode

Control screenshot delivery via screenshot_delivery option:

  • "base64" - Inline base64 (~1MB payload, default)
  • "url" - Presigned URL (~200KB payload, expires in 24h, recommended for AI agents)
Viewport Size

Captures the visible viewport only (1024×768 standard), not full-page screenshots.

Typical Size

Base64: ~500KB-1MB JSON payload. URL: ~200KB JSON payload (image downloaded separately).

Usage

Handle both formats in your code:

const imgSrc = screenshot.type === 'url' ? screenshot.url : `data:image/png;base64,${screenshot.data}`;

Example: Multimodal Verification Workflow

Here's how to combine Visual Mode with GPT-4V for intelligent verification:

1# Step 1: Navigate to product page (fast, Map Mode)
2page = sentience.observe(
3    url="https://amazon.com/product/B123",
4    mode="map"
5)
6add_to_cart = find_element(page, text="Add to Cart")
7
8# Step 2: Visual verification before clicking (Visual Mode)
9visual = sentience.observe(
10    url="https://amazon.com/product/B123",
11    mode="visual",
12    options={"limit": 50}
13)
14
15# Step 3: Send to GPT-4V for verification
16is_correct = openai.chat.completions.create(
17    model="gpt-4-vision-preview",
18    messages=[{
19        "role": "user",
20        "content": [
21            {"type": "text", "text": "Is this the 'Wireless Headphones' product?"},
22            {"type": "image_url", "image_url": {"url": visual["screenshot"]}}
23        ]
24    }]
25)
26
27# Step 4: Only proceed if verification passes
28if is_correct:
29    click(add_to_cart["bbox"])
30    print("Product verified and added to cart!")
31else:
32    print("Wrong product detected, aborting.")

Error Responses

If Visual Mode encounters an error, you'll receive an error response:

{
  "error": "Failed to render page: timeout after 30s"
}
Note: If screenshot capture fails but geometry extraction succeeds, you'll receive the element data with screenshot_error containing the error message and screenshot: null.
400

Bad Request

Invalid URL, mode, or options

401

Unauthorized

Missing or invalid API key

500

Internal Server Error

Failed to capture screenshot or extract elements

Next Steps

Ready to explore more modes or dive deeper into the API?