Comparing Browser Automation Approaches for AI Agents

An objective analysis of semantic data vs. vision-based methods, based on real-world experiments with Google Search and Amazon Shopping workflows.

As AI agents become more capable, developers need reliable methods for web automation. This page presents findings from controlled experiments comparing three approaches:

  • Semantic Element Discovery (Sentience SDK)
  • Vision-Based Automation (GPT-4 Vision)
  • Traditional Selector-Based (Playwright)

All experiments are open-source and reproducible.

🔬 All Experiments Are Open Source

Every experiment presented on this page is fully reproducible. Clone the repository, run the demos yourself, and verify our findings independently.

→ github.com/SentienceAPI/sentience-sdk-playground

Experiment Setup

  • Conducted: December 22-23, 2024
  • Tasks: Google Search, Amazon Shopping, Local LLM demos
  • Models: GPT-4 Turbo, GPT-4 Vision, Qwen 2.5 3B
  • Repository: github.com/SentienceAPI/sentience-sdk-playground
  • Methodology: Same tasks, different automation approaches
  • Reproducibility: All code and data available open-source

For a technical deep dive into why vision-only approaches fail for web automation:

Why Web Agents Fail and How Semantic Geometry Helps Them Execute

Approaches Compared

We evaluated three distinct approaches to browser automation, each with different strengths and tradeoffs.

Semantic Element Discovery

Sentience SDK

How It Works:

  1. Navigate to webpage
  2. Call snapshot() to extract structured element data
  3. Filter elements by role, importance, or criteria
  4. Send filtered JSON to LLM for decision-making
  5. Execute actions using precise coordinates

Strengths:

  • Structured data enables intelligent filtering
  • Deterministic behavior (same input → same output)
  • Token-efficient through pre-filtering

Limitations:

  • Requires browser extension (Chrome only currently)
  • Depends on proper ARIA roles

Best For:

Production automation workflows, token-sensitive applications, complex multi-step tasks

Vision-Based Automation

GPT-4 Vision + Playwright

How It Works:

  1. Navigate to webpage
  2. Capture full-page screenshot (1920×1080 PNG)
  3. Send screenshot to vision model (GPT-4V)
  4. LLM analyzes pixels to identify elements visually
  5. Returns estimated coordinates for target
  6. Attempt to click estimated coordinates

Strengths:

  • Works with any visual interface
  • Can identify elements by visual appearance
  • No browser extension required

Limitations:

  • High token cost (images are token-intensive)
  • Non-deterministic (coordinate estimation varies)
  • Content policy concerns (automation flagged)

Best For:

OCR and visual content extraction, one-off manual testing scenarios

Traditional Selectors

Playwright

How It Works:

  1. Navigate to webpage
  2. Inspect page to identify CSS selectors
  3. Write automation scripts using selectors
  4. Execute actions directly via DOM manipulation

Strengths:

  • Fast execution (no LLM overhead)
  • Precise targeting (direct DOM access)
  • Mature ecosystem with extensive tooling

Limitations:

  • Brittle (breaks on CSS/DOM changes)
  • Requires manual maintenance
  • No semantic understanding

Best For:

Static websites with stable structure, regression testing with fixed test cases

Experimental Results

Data from controlled experiments with neutral framing. All experiments are reproducible via the GitHub repository.

🔍 Experiment 1: Google Search

↑ Back to top

Task

Navigate to Google, search for "visiting japan", click first non-ad result

Methodology

  • • Same task executed with SDK and Vision approaches
  • • 2 runs each to test consistency
  • • Token usage measured via LLM API
  • • Success defined as: correct result clicked, navigation confirmed

Results

MetricSDK + Semantic DataVision + GPT-4oNotes
Success Rate100% (2/2 runs)0% (0/2 runs)SDK achieved task completion in both runs
Token Usage2,636 tokens/runN/A (task failed)SDK optimized via element filtering
Optimization73% reductionNot applicableFiltering reduced 9,800 → 2,636 tokens
Execution Time~5-8 secondsN/A (task failed)Including LLM inference time
Failure ModeNone observedEmpty LLM responsesVision model refused to respond

SDK Approach - Detailed Breakdown

StepElements After FilterTokens
Find search box1 element (combobox)~800
Select result7-8 links (ads filtered)~1,800
Total~9 elements2,636

Filtering Strategy: Scene 1 excluded decorative roles (img, svg, button, link) → kept only inputs. Scene 3 excluded inputs/buttons → kept only links, filtered ads by text patterns.

Vision Approach - Failure Analysis

Both runs resulted in empty responses from GPT-4 Vision

Possible Causes:

  • • Content policy restrictions
  • • Coordinate estimation challenges
  • • Inability to distinguish ads from results

Outcome:

No successful task completions achieved in any run

🛒 Experiment 2: Amazon Shopping

↑ Back to top

Task

Search for "Christmas gift", select product, add to cart

Methodology

  • • Multi-step workflow (5 scenes total)
  • • SDK: 1 successful run; Vision: 3 attempted runs
  • • Measured end-to-end completion
  • • Success defined as: item successfully added to cart

Results

MetricSDK + Semantic DataVision + GPT-4oNotes
Success Rate100% (1/1 runs)0% (0/3 runs)Vision failed all attempts
Token Usage19,956 tokensN/A (never completed)SDK optimized per scene
Completion Rate5/5 scenes0/5 scenesVision failed before completion
Execution Time~60 secondsN/A (crashed)Full workflow duration
Optimization43% reductionNot applicablePre-filtering saved ~15k tokens

SDK Approach - Scene-by-Scene Breakdown

SceneTaskElements SentTokensOutcome
1Find search barFiltered to inputs only956Success
2Type "Christmas gift"N/A (direct keyboard)0Success
3Select productFiltered to links only5,875Success
4Click "Add to Cart"Filtered to buttons only5,495Success
5Verify cart successFull element set7,630Success
TotalComplete workflowOptimized per scene19,956Success

Vision Approach - Failure Analysis

RunFailure PointObserved BehaviorRoot Cause
1Scene 2Search text not enteredKeyboard input not executed
2Scene 3Wrong page (vehicle parts)Navigation failure undetected
3Scene 4Cannot find Add to CartAnalyzing incorrect page
Common Failure Patterns
  • • Search text input not detected as unsuccessful
  • • Navigation failures not recognized by vision model
  • • Context confusion (wrong product category)
  • • Inability to verify element semantics (button vs. div)

💻 Experiment 3: Local LLM Integration

↑ Back to top

Task

Google Search with local small language model (Qwen 2.5 3B, ~6GB)

Methodology

  • • Same Google Search task as Experiment 1
  • • Replaced cloud LLM (GPT-4) with local model (Qwen 2.5 3B)
  • • Measured performance, token usage, accuracy

Results

MetricCloud LLM (GPT-4)Local LLM (Qwen 2.5 3B)Notes
Success Rate100%~85%Local model slightly less accurate
Token Usage2,636 tokens~2,500 tokensSimilar due to same filtering
Cost per Run~$0.03 (API fees)$0 (local inference)Local eliminates API costs
Inference Speed (GPU)~8 seconds~5 secondsLocal faster with GPU
Inference Speed (CPU)~8 seconds~25 secondsCPU inference slower
Model SizeN/A (API)~6 GB disk spaceOne-time download
PrivacyCloud processingLocal processingAll data stays on device

Key Findings

  • • Local small models (3B parameters) can handle web automation tasks
  • • Accuracy tradeoff: ~85% vs. 98% for complex scenarios
  • • Zero API costs after initial model download
  • • Privacy advantage: no data sent to external services
  • • GPU recommended for reasonable inference speed

Use Cases for Local LLMs

  • • Cost-sensitive automation (high volume)
  • • Privacy-critical applications
  • • Offline/air-gapped environments
  • • Rapid iteration during development

Try It Live

Explore interactive SDK examples or test the API directly with real automation scenarios

Navigate to a login page, find email/password fields semantically, and submit the form.

1# No selectors. No vision. Stable semantic targets.
2from sentience import SentienceBrowser, snapshot, find, click, type_text, wait_for
3
4# Initialize browser with API key
5browser = SentienceBrowser(api_key="sk_live_...")
6browser.start()
7
8# Navigate to login page
9browser.page.goto("https://example.com/login")
10
11# PERCEPTION: Find elements semantically
12snap = snapshot(browser)
13email_field = find(snap, "role=textbox text~'email'")
14password_field = find(snap, "role=textbox text~'password'")
15submit_btn = find(snap, "role=button text~'sign in'")
16
17# ACTION: Interact with the page
18type_text(browser, email_field.id, "user@example.com")
19type_text(browser, password_field.id, "secure_password")
20click(browser, submit_btn.id)
21
22# VERIFICATION: Wait for navigation
23wait_for(browser, "role=heading text~'Dashboard'", timeout=5.0)
24
25print("✅ Login successful!")
26browser.close()

🎯 Semantic Discovery

Find elements by role, text, and visual cues - not fragile CSS selectors

⚡ Token Optimization

Intelligent filtering reduces token usage by up to 73% vs vision models

🔒 Deterministic

Same input produces same output every time - no random failures