Turn any dynamic website into clean, LLM-ready Markdown. Power your RAG pipelines without cluttering context windows.
Feeding raw HTML into an LLM is inefficient and error-prone. Standard scrapers fail to capture the actual content:
Token Waste
Raw HTML is ~90% boilerplate (divs, classes, scripts). A single article can consume 15k+ tokens just for layout code.
Dynamic Content
Standard libraries like BeautifulSoup cannot see content rendered by JavaScript (React/Next.js/Vue apps).
Our Reader Mode renders the full page, executes JavaScript, and distills the visual output into structured Markdown:
We spin up a headless browser to load the page as a user sees it. This captures content hidden behind hydration, lazy loading, or client-side routing.
Our algorithms identify and strip navigation bars, footers, ads, cookie banners, and "Recommended for You" widgets automatically.
Content is returned as semantic Markdown (`# headers`, `| tables |`, `- lists`). Perfect for direct insertion into vector databases.
Common patterns for LLM applications
Scrape documentation or articles to build a knowledge base:
1def ingest_url(url):
2 # 1. Fetch clean markdown (1 credit)
3 response = sentience.observe(
4 url=url,
5 mode="read" # <--- Activates Reader Engine
6 )
7
8 markdown_text = response["content"]
9
10 # 2. Chunk and Embed
11 chunks = split_text(markdown_text)
12 vectors = openai.embed(chunks)
13
14 # 3. Store
15 pinecone.upsert(vectors)✅ Outcome: High-quality embeddings without noise pollution.
Live browsing for AI agents to answer questions about a specific page:
1curl -X POST https://api.sentienceapi.com/v1/observe \
2 -H "Authorization: Bearer YOUR_API_KEY" \
3 -H "Content-Type: application/json" \
4 -d '{
5 "url": "https://en.wikipedia.org/wiki/Artificial_intelligence",
6 "mode": "read"
7 }'✅ Outcome: Returns ~2k tokens of dense text instead of ~50k tokens of HTML.
Clean, normalized Markdown ready for embedding or LLM consumption:
1{
2 "status": "success",
3 "url": "https://en.wikipedia.org/wiki/Artificial_intelligence",
4 "title": "Artificial intelligence - Wikipedia",
5 "content": "# Artificial intelligence\n\nArtificial intelligence (AI) is intelligence demonstrated by machines, in contrast to natural intelligence displayed by animals including humans. Leading AI textbooks define the field as the study of intelligent agents...\n\n## History\n\nThe field of AI research was founded at a workshop held on the campus of Dartmouth College during the summer of 1956...\n\n## Applications\n\n- Natural language processing\n- Computer vision\n- Robotics\n- Expert systems",
6 "author": null,
7 "published_date": null,
8 "word_count": 1247,
9 "reading_time_minutes": 6,
10 "timestamp": "2025-12-12T10:30:00.123Z"
11}💡 Normalized Text: Special characters like curly quotes (""), em dashes (—), and HTML entities (&) are automatically converted to standard ASCII equivalents. This saves ~15-20% tokens in embeddings and prevents encoding issues in vector databases.
Raw HTML (Standard Scraping)
Payload Size: 2.4 MB
Context Usage: ~45,000 tokens
Readability: Low (Mixed with code)
Sentience Reader Mode
Payload Size: 15 KB
Context Usage: ~1,200 tokens
Readability: High (Clean Markdown)
97% Context Reduction
Fit 30+ pages into a single prompt window instead of just 1.