Comparing Browser Automation Approaches for AI Agents
An objective analysis of semantic data vs. vision-based methods, based on real-world experiments with Google Search and Amazon Shopping workflows.
As AI agents become more capable, developers need reliable methods for web automation. This page presents findings from controlled experiments comparing three approaches:
- •Semantic Element Discovery (Sentience SDK)
- •Vision-Based Automation (GPT-4 Vision)
- •Traditional Selector-Based (Playwright)
All experiments are open-source and reproducible.
🔬 All Experiments Are Open Source
Every experiment presented on this page is fully reproducible. Clone the repository, run the demos yourself, and verify our findings independently.
→ github.com/SentienceAPI/sentience-sdk-playgroundExperiment Setup
- • Conducted: December 22-23, 2024
- • Tasks: Google Search, Amazon Shopping, Local LLM demos
- • Models: GPT-4 Turbo, GPT-4 Vision, Qwen 2.5 3B
- • Repository: github.com/SentienceAPI/sentience-sdk-playground
- • Methodology: Same tasks, different automation approaches
- • Reproducibility: All code and data available open-source
For a technical deep dive into why vision-only approaches fail for web automation:
Why Web Agents Fail and How Semantic Geometry Helps Them ExecuteApproaches Compared
We evaluated three distinct approaches to browser automation, each with different strengths and tradeoffs.
Semantic Element Discovery
Sentience SDK
How It Works:
- Navigate to webpage
- Call snapshot() to extract structured element data
- Filter elements by role, importance, or criteria
- Send filtered JSON to LLM for decision-making
- Execute actions using precise coordinates
Strengths:
- Structured data enables intelligent filtering
- Deterministic behavior (same input → same output)
- Token-efficient through pre-filtering
Limitations:
- Requires browser extension (Chrome only currently)
- Depends on proper ARIA roles
Best For:
Production automation workflows, token-sensitive applications, complex multi-step tasks
Vision-Based Automation
GPT-4 Vision + Playwright
How It Works:
- Navigate to webpage
- Capture full-page screenshot (1920×1080 PNG)
- Send screenshot to vision model (GPT-4V)
- LLM analyzes pixels to identify elements visually
- Returns estimated coordinates for target
- Attempt to click estimated coordinates
Strengths:
- Works with any visual interface
- Can identify elements by visual appearance
- No browser extension required
Limitations:
- High token cost (images are token-intensive)
- Non-deterministic (coordinate estimation varies)
- Content policy concerns (automation flagged)
Best For:
OCR and visual content extraction, one-off manual testing scenarios
Traditional Selectors
Playwright
How It Works:
- Navigate to webpage
- Inspect page to identify CSS selectors
- Write automation scripts using selectors
- Execute actions directly via DOM manipulation
Strengths:
- Fast execution (no LLM overhead)
- Precise targeting (direct DOM access)
- Mature ecosystem with extensive tooling
Limitations:
- Brittle (breaks on CSS/DOM changes)
- Requires manual maintenance
- No semantic understanding
Best For:
Static websites with stable structure, regression testing with fixed test cases
Experimental Results
Data from controlled experiments with neutral framing. All experiments are reproducible via the GitHub repository.
🔍 Experiment 1: Google Search
↑ Back to topTask
Navigate to Google, search for "visiting japan", click first non-ad result
Methodology
- • Same task executed with SDK and Vision approaches
- • 2 runs each to test consistency
- • Token usage measured via LLM API
- • Success defined as: correct result clicked, navigation confirmed
Results
| Metric | SDK + Semantic Data | Vision + GPT-4o | Notes |
|---|---|---|---|
| Success Rate | 100% (2/2 runs) | 0% (0/2 runs) | SDK achieved task completion in both runs |
| Token Usage | 2,636 tokens/run | N/A (task failed) | SDK optimized via element filtering |
| Optimization | 73% reduction | Not applicable | Filtering reduced 9,800 → 2,636 tokens |
| Execution Time | ~5-8 seconds | N/A (task failed) | Including LLM inference time |
| Failure Mode | None observed | Empty LLM responses | Vision model refused to respond |
SDK Approach - Detailed Breakdown
| Step | Elements After Filter | Tokens |
|---|---|---|
| Find search box | 1 element (combobox) | ~800 |
| Select result | 7-8 links (ads filtered) | ~1,800 |
| Total | ~9 elements | 2,636 |
Filtering Strategy: Scene 1 excluded decorative roles (img, svg, button, link) → kept only inputs. Scene 3 excluded inputs/buttons → kept only links, filtered ads by text patterns.
Vision Approach - Failure Analysis
Both runs resulted in empty responses from GPT-4 Vision
Possible Causes:
- • Content policy restrictions
- • Coordinate estimation challenges
- • Inability to distinguish ads from results
Outcome:
No successful task completions achieved in any run
🛒 Experiment 2: Amazon Shopping
↑ Back to topTask
Search for "Christmas gift", select product, add to cart
Methodology
- • Multi-step workflow (5 scenes total)
- • SDK: 1 successful run; Vision: 3 attempted runs
- • Measured end-to-end completion
- • Success defined as: item successfully added to cart
Results
| Metric | SDK + Semantic Data | Vision + GPT-4o | Notes |
|---|---|---|---|
| Success Rate | 100% (1/1 runs) | 0% (0/3 runs) | Vision failed all attempts |
| Token Usage | 19,956 tokens | N/A (never completed) | SDK optimized per scene |
| Completion Rate | 5/5 scenes | 0/5 scenes | Vision failed before completion |
| Execution Time | ~60 seconds | N/A (crashed) | Full workflow duration |
| Optimization | 43% reduction | Not applicable | Pre-filtering saved ~15k tokens |
SDK Approach - Scene-by-Scene Breakdown
| Scene | Task | Elements Sent | Tokens | Outcome |
|---|---|---|---|---|
| 1 | Find search bar | Filtered to inputs only | 956 | Success |
| 2 | Type "Christmas gift" | N/A (direct keyboard) | 0 | Success |
| 3 | Select product | Filtered to links only | 5,875 | Success |
| 4 | Click "Add to Cart" | Filtered to buttons only | 5,495 | Success |
| 5 | Verify cart success | Full element set | 7,630 | Success |
| Total | Complete workflow | Optimized per scene | 19,956 | Success |
Vision Approach - Failure Analysis
| Run | Failure Point | Observed Behavior | Root Cause |
|---|---|---|---|
| 1 | Scene 2 | Search text not entered | Keyboard input not executed |
| 2 | Scene 3 | Wrong page (vehicle parts) | Navigation failure undetected |
| 3 | Scene 4 | Cannot find Add to Cart | Analyzing incorrect page |
Common Failure Patterns
- • Search text input not detected as unsuccessful
- • Navigation failures not recognized by vision model
- • Context confusion (wrong product category)
- • Inability to verify element semantics (button vs. div)
💻 Experiment 3: Local LLM Integration
↑ Back to topTask
Google Search with local small language model (Qwen 2.5 3B, ~6GB)
Methodology
- • Same Google Search task as Experiment 1
- • Replaced cloud LLM (GPT-4) with local model (Qwen 2.5 3B)
- • Measured performance, token usage, accuracy
Results
| Metric | Cloud LLM (GPT-4) | Local LLM (Qwen 2.5 3B) | Notes |
|---|---|---|---|
| Success Rate | 100% | ~85% | Local model slightly less accurate |
| Token Usage | 2,636 tokens | ~2,500 tokens | Similar due to same filtering |
| Cost per Run | ~$0.03 (API fees) | $0 (local inference) | Local eliminates API costs |
| Inference Speed (GPU) | ~8 seconds | ~5 seconds | Local faster with GPU |
| Inference Speed (CPU) | ~8 seconds | ~25 seconds | CPU inference slower |
| Model Size | N/A (API) | ~6 GB disk space | One-time download |
| Privacy | Cloud processing | Local processing | All data stays on device |
Key Findings
- • Local small models (3B parameters) can handle web automation tasks
- • Accuracy tradeoff: ~85% vs. 98% for complex scenarios
- • Zero API costs after initial model download
- • Privacy advantage: no data sent to external services
- • GPU recommended for reasonable inference speed
Use Cases for Local LLMs
- • Cost-sensitive automation (high volume)
- • Privacy-critical applications
- • Offline/air-gapped environments
- • Rapid iteration during development
Try It Live
Explore interactive SDK examples or test the API directly with real automation scenarios
Navigate to a login page, find email/password fields semantically, and submit the form.
1# No selectors. No vision. Stable semantic targets.
2from sentience import SentienceBrowser, snapshot, find, click, type_text, wait_for
3
4# Initialize browser with API key
5browser = SentienceBrowser(api_key="sk_live_...")
6browser.start()
7
8# Navigate to login page
9browser.page.goto("https://example.com/login")
10
11# PERCEPTION: Find elements semantically
12snap = snapshot(browser)
13email_field = find(snap, "role=textbox text~'email'")
14password_field = find(snap, "role=textbox text~'password'")
15submit_btn = find(snap, "role=button text~'sign in'")
16
17# ACTION: Interact with the page
18type_text(browser, email_field.id, "user@example.com")
19type_text(browser, password_field.id, "secure_password")
20click(browser, submit_btn.id)
21
22# VERIFICATION: Wait for navigation
23wait_for(browser, "role=heading text~'Dashboard'", timeout=5.0)
24
25print("✅ Login successful!")
26browser.close()🎯 Semantic Discovery
Find elements by role, text, and visual cues - not fragile CSS selectors
⚡ Token Optimization
Intelligent filtering reduces token usage by up to 73% vs vision models
🔒 Deterministic
Same input produces same output every time - no random failures