Comparing Browser Automation Approaches for AI Agents

An objective analysis of semantic data vs. vision-based methods, based on real-world experiments with Google Search and Amazon Shopping workflows.

🔍 Google Search Experiment 🛒 Amazon Shopping Experiment 💻 Local LLM Experiment

As AI agents become more capable, developers need reliable methods for web automation. This page presents findings from controlled experiments comparing three approaches:

•Semantic Element Discovery (Sentience SDK)
•Vision-Based Automation (GPT-4 Vision)
•Traditional Selector-Based (Playwright)

All experiments are open-source and reproducible.

🔬 All Experiments Are Open Source

Every experiment presented on this page is fully reproducible. Clone the repository, run the demos yourself, and verify our findings independently.

→ github.com/SentienceAPI/sentience-sdk-playground

Experiment Setup

• Conducted: December 22-23, 2024
• Tasks: Google Search, Amazon Shopping, Local LLM demos
• Models: GPT-4 Turbo, GPT-4 Vision, Qwen 2.5 3B
• Repository: github.com/SentienceAPI/sentience-sdk-playground
• Methodology: Same tasks, different automation approaches
• Reproducibility: All code and data available open-source

For a technical deep dive into why vision-only approaches fail for web automation:

Why Web Agents Fail and How Semantic Geometry Helps Them Execute

Approaches Compared

We evaluated three distinct approaches to browser automation, each with different strengths and tradeoffs.

Semantic Element Discovery

Sentience SDK

How It Works:

Navigate to webpage
Call snapshot() to extract structured element data
Filter elements by role, importance, or criteria
Send filtered JSON to LLM for decision-making
Execute actions using precise coordinates

Strengths:

Structured data enables intelligent filtering
Deterministic behavior (same input → same output)
Token-efficient through pre-filtering

Limitations:

Requires browser extension (Chrome only currently)
Depends on proper ARIA roles

Best For:

Production automation workflows, token-sensitive applications, complex multi-step tasks

Vision-Based Automation

GPT-4 Vision + Playwright

How It Works:

Navigate to webpage
Capture full-page screenshot (1920×1080 PNG)
Send screenshot to vision model (GPT-4V)
LLM analyzes pixels to identify elements visually
Returns estimated coordinates for target
Attempt to click estimated coordinates

Strengths:

Works with any visual interface
Can identify elements by visual appearance
No browser extension required

Limitations:

High token cost (images are token-intensive)
Non-deterministic (coordinate estimation varies)
Content policy concerns (automation flagged)

Best For:

OCR and visual content extraction, one-off manual testing scenarios

Traditional Selectors

Playwright

How It Works:

Navigate to webpage
Inspect page to identify CSS selectors
Write automation scripts using selectors
Execute actions directly via DOM manipulation

Strengths:

Fast execution (no LLM overhead)
Precise targeting (direct DOM access)
Mature ecosystem with extensive tooling

Limitations:

Brittle (breaks on CSS/DOM changes)
Requires manual maintenance
No semantic understanding

Best For:

Static websites with stable structure, regression testing with fixed test cases

Experimental Results

Data from controlled experiments with neutral framing. All experiments are reproducible via the GitHub repository.

🔍 Experiment 1: Google Search

↑ Back to top

Task

Navigate to Google, search for "visiting japan", click first non-ad result

Methodology

• Same task executed with SDK and Vision approaches
• 2 runs each to test consistency
• Token usage measured via LLM API
• Success defined as: correct result clicked, navigation confirmed

Results

Metric	SDK + Semantic Data	Vision + GPT-4o	Notes
Success Rate	100% (2/2 runs)	0% (0/2 runs)	SDK achieved task completion in both runs
Token Usage	2,636 tokens/run	N/A (task failed)	SDK optimized via element filtering
Optimization	73% reduction	Not applicable	Filtering reduced 9,800 → 2,636 tokens
Execution Time	~5-8 seconds	N/A (task failed)	Including LLM inference time
Failure Mode	None observed	Empty LLM responses	Vision model refused to respond

SDK Approach - Detailed Breakdown

Step	Elements After Filter	Tokens
Find search box	1 element (combobox)	~800
Select result	7-8 links (ads filtered)	~1,800
Total	~9 elements	2,636

Filtering Strategy: Scene 1 excluded decorative roles (img, svg, button, link) → kept only inputs. Scene 3 excluded inputs/buttons → kept only links, filtered ads by text patterns.

Vision Approach - Failure Analysis

Both runs resulted in empty responses from GPT-4 Vision

Possible Causes:

• Content policy restrictions
• Coordinate estimation challenges
• Inability to distinguish ads from results

Outcome:

No successful task completions achieved in any run

🛒 Experiment 2: Amazon Shopping

↑ Back to top

Task

Search for "Christmas gift", select product, add to cart

Methodology

• Multi-step workflow (5 scenes total)
• SDK: 1 successful run; Vision: 3 attempted runs
• Measured end-to-end completion
• Success defined as: item successfully added to cart

Results

Metric	SDK + Semantic Data	Vision + GPT-4o	Notes
Success Rate	100% (1/1 runs)	0% (0/3 runs)	Vision failed all attempts
Token Usage	19,956 tokens	N/A (never completed)	SDK optimized per scene
Completion Rate	5/5 scenes	0/5 scenes	Vision failed before completion
Execution Time	~60 seconds	N/A (crashed)	Full workflow duration
Optimization	43% reduction	Not applicable	Pre-filtering saved ~15k tokens

SDK Approach - Scene-by-Scene Breakdown

Scene	Task	Elements Sent	Tokens	Outcome
1	Find search bar	Filtered to inputs only	956	Success
2	Type "Christmas gift"	N/A (direct keyboard)	0	Success
3	Select product	Filtered to links only	5,875	Success
4	Click "Add to Cart"	Filtered to buttons only	5,495	Success
5	Verify cart success	Full element set	7,630	Success
Total	Complete workflow	Optimized per scene	19,956	Success

Vision Approach - Failure Analysis

Run	Failure Point	Observed Behavior	Root Cause
1	Scene 2	Search text not entered	Keyboard input not executed
2	Scene 3	Wrong page (vehicle parts)	Navigation failure undetected
3	Scene 4	Cannot find Add to Cart	Analyzing incorrect page

Common Failure Patterns

• Search text input not detected as unsuccessful
• Navigation failures not recognized by vision model
• Context confusion (wrong product category)
• Inability to verify element semantics (button vs. div)

💻 Experiment 3: Local LLM Integration

↑ Back to top

Task

Google Search with local small language model (Qwen 2.5 3B, ~6GB)

Methodology

• Same Google Search task as Experiment 1
• Replaced cloud LLM (GPT-4) with local model (Qwen 2.5 3B)
• Measured performance, token usage, accuracy

Results

Metric	Cloud LLM (GPT-4)	Local LLM (Qwen 2.5 3B)	Notes
Success Rate	100%	~85%	Local model slightly less accurate
Token Usage	2,636 tokens	~2,500 tokens	Similar due to same filtering
Cost per Run	~$0.03 (API fees)	$0 (local inference)	Local eliminates API costs
Inference Speed (GPU)	~8 seconds	~5 seconds	Local faster with GPU
Inference Speed (CPU)	~8 seconds	~25 seconds	CPU inference slower
Model Size	N/A (API)	~6 GB disk space	One-time download
Privacy	Cloud processing	Local processing	All data stays on device

Key Findings

• Local small models (3B parameters) can handle web automation tasks
• Accuracy tradeoff: ~85% vs. 98% for complex scenarios
• Zero API costs after initial model download
• Privacy advantage: no data sent to external services
• GPU recommended for reasonable inference speed

Use Cases for Local LLMs

• Cost-sensitive automation (high volume)
• Privacy-critical applications
• Offline/air-gapped environments
• Rapid iteration during development

Try It Live

Explore interactive SDK examples or test the API directly with real automation scenarios

Select Example:

Navigate to a login page, find email/password fields semantically, and submit the form.

1# No selectors. No vision. Stable semantic targets.
2from sentience import SentienceBrowser, snapshot, find, click, type_text, wait_for
3
4# Initialize browser with API key
5browser = SentienceBrowser(api_key="sk_live_...")
6browser.start()
7
8# Navigate to login page
9browser.page.goto("https://example.com/login")
10
11# PERCEPTION: Find elements semantically
12snap = snapshot(browser)
13email_field = find(snap, "role=textbox text~'email'")
14password_field = find(snap, "role=textbox text~'password'")
15submit_btn = find(snap, "role=button text~'sign in'")
16
17# ACTION: Interact with the page
18type_text(browser, email_field.id, "user@example.com")
19type_text(browser, password_field.id, "secure_password")
20click(browser, submit_btn.id)
21
22# VERIFICATION: Wait for navigation
23wait_for(browser, "role=heading text~'Dashboard'", timeout=5.0)
24
25print("✅ Login successful!")
26browser.close()

Execution Output

🎯 Semantic Discovery

Find elements by role, text, and visual cues - not fragile CSS selectors

⚡ Token Optimization

Intelligent filtering reduces token usage by up to 73% vs vision models

🔒 Deterministic

Same input produces same output every time - no random failures