Docs/Use Cases/RAG_Extraction

RAG & Knowledge Extraction

Turn any dynamic website into clean, LLM-ready Markdown. Power your RAG pipelines without cluttering context windows.

The HTML Noise Problem

Feeding raw HTML into an LLM is inefficient and error-prone. Standard scrapers fail to capture the actual content:

Token Waste

Raw HTML is ~90% boilerplate (divs, classes, scripts). A single article can consume 15k+ tokens just for layout code.

Dynamic Content

Standard libraries like BeautifulSoup cannot see content rendered by JavaScript (React/Next.js/Vue apps).

The Solution: Intelligent Reader Engine

Our Reader Mode renders the full page, executes JavaScript, and distills the visual output into structured Markdown:

1

Full Rendering

We spin up a headless browser to load the page as a user sees it. This captures content hidden behind hydration, lazy loading, or client-side routing.

2

Noise Removal

Our algorithms identify and strip navigation bars, footers, ads, cookie banners, and "Recommended for You" widgets automatically.

3

Semantic Formatting

Content is returned as semantic Markdown (`# headers`, `| tables |`, `- lists`). Perfect for direct insertion into vector databases.

Workflow Examples

Common patterns for LLM applications

Ingestion for Vector DB

Scrape documentation or articles to build a knowledge base:

1def ingest_url(url):
2    # 1. Fetch clean markdown (1 credit)
3    response = sentience.observe(
4        url=url,
5        mode="read"  # <--- Activates Reader Engine
6    )
7
8    markdown_text = response["content"]
9
10    # 2. Chunk and Embed
11    chunks = split_text(markdown_text)
12    vectors = openai.embed(chunks)
13
14    # 3. Store
15    pinecone.upsert(vectors)

✅ Outcome: High-quality embeddings without noise pollution.

"Chat with Website" Feature

Live browsing for AI agents to answer questions about a specific page:

1curl -X POST https://api.sentienceapi.com/v1/observe \
2  -H "Authorization: Bearer YOUR_API_KEY" \
3  -H "Content-Type: application/json" \
4  -d '{
5    "url": "https://en.wikipedia.org/wiki/Artificial_intelligence",
6    "mode": "read"
7  }'

✅ Outcome: Returns ~2k tokens of dense text instead of ~50k tokens of HTML.

Sample Response

Clean, normalized Markdown ready for embedding or LLM consumption:

1{
2  "status": "success",
3  "url": "https://en.wikipedia.org/wiki/Artificial_intelligence",
4  "title": "Artificial intelligence - Wikipedia",
5  "content": "# Artificial intelligence\n\nArtificial intelligence (AI) is intelligence demonstrated by machines, in contrast to natural intelligence displayed by animals including humans. Leading AI textbooks define the field as the study of intelligent agents...\n\n## History\n\nThe field of AI research was founded at a workshop held on the campus of Dartmouth College during the summer of 1956...\n\n## Applications\n\n- Natural language processing\n- Computer vision\n- Robotics\n- Expert systems",
6  "author": null,
7  "published_date": null,
8  "word_count": 1247,
9  "reading_time_minutes": 6,
10  "timestamp": "2025-12-12T10:30:00.123Z"
11}

💡 Normalized Text: Special characters like curly quotes (""), em dashes (—), and HTML entities (&) are automatically converted to standard ASCII equivalents. This saves ~15-20% tokens in embeddings and prevents encoding issues in vector databases.

The "Reader Mode" Advantage

Raw HTML (Standard Scraping)

Payload Size: 2.4 MB

Context Usage: ~45,000 tokens

Readability: Low (Mixed with code)

Sentience Reader Mode

Payload Size: 15 KB

Context Usage: ~1,200 tokens

Readability: High (Clean Markdown)

97% Context Reduction

Fit 30+ pages into a single prompt window instead of just 1.