Extract clean text from reviews, social media, and articles. Perfect input for GPT-4, Claude, and other classification models.
Sentiment analysis models need clean, readable text. Raw HTML and JavaScript-rendered content create noise that degrades accuracy:
HTML Noise
Navigation menus, footers, and ad scripts pollute the text. Classification models struggle to identify actual content vs. boilerplate.
Dynamic Content
Client-side rendered reviews and comments are invisible to static scrapers. You miss the actual user-generated content.
Reader Mode extracts only the semantic content, removing all HTML noise. The result is clean Markdown or plain text perfect for sentiment analysis:
We render the complete page with JavaScript execution, capturing all dynamically loaded content including reviews, comments, and user-generated text.
Our algorithms automatically strip navigation, ads, cookie banners, and other non-content elements. Only the actual article or review text remains.
Choose Markdown (structured) or plain text (minimal) format. Both are optimized for LLM consumption and classification models.
Common patterns for sentiment analysis pipelines
Extract reviews from e-commerce sites and analyze sentiment:
1import requests
2from openai import OpenAI
3
4# 1. Extract clean text from review page
5response = requests.post(
6 "https://api.sentienceapi.com/v1/observe",
7 headers={"Authorization": "Bearer YOUR_API_KEY"},
8 json={
9 "url": "https://www.amazon.com/product-reviews/B08N5WRWNW",
10 "mode": "read",
11 "format": "text" # Plain text for cleaner analysis
12 }
13)
14
15data = response.json()
16review_text = data["content"]
17
18# 2. Analyze sentiment with GPT-4
19client = OpenAI()
20sentiment = client.chat.completions.create(
21 model="gpt-4",
22 messages=[{
23 "role": "system",
24 "content": "Analyze sentiment: positive, negative, or neutral"
25 }, {
26 "role": "user",
27 "content": review_text
28 }]
29)
30
31print(f"Sentiment: {sentiment.choices[0].message.content}")✅ Outcome: Clean review text without HTML noise, ready for accurate sentiment classification.
Process multiple articles for sentiment tracking:
1urls = [
2 "https://techcrunch.com/article1",
3 "https://techcrunch.com/article2",
4 "https://techcrunch.com/article3"
5]
6
7sentiments = []
8for url in urls:
9 # Extract content (1 credit per request)
10 response = requests.post(
11 "https://api.sentienceapi.com/v1/observe",
12 headers={"Authorization": "Bearer YOUR_API_KEY"},
13 json={"url": url, "mode": "read", "format": "text"}
14 )
15
16 content = response.json()["content"]
17
18 # Analyze sentiment
19 sentiment = analyze_with_claude(content)
20 sentiments.append({
21 "url": url,
22 "sentiment": sentiment,
23 "word_count": response.json()["word_count"]
24 })
25
26print(f"Processed {len(sentiments)} articles")✅ Outcome: Fast batch processing at 1 credit per article (~400ms each).
Clean, normalized content ready for sentiment analysis:
1{
2 "status": "success",
3 "url": "https://www.amazon.com/product-reviews/B08N5WRWNW",
4 "title": "Customer reviews: Wireless Earbuds",
5 "content": "Customer Reviews\n\nJohn D. - 5 stars\n\nThese earbuds are absolutely fantastic! The sound quality is incredible and the battery life lasts all day. I've been using them for a month now and couldn't be happier. The noise cancellation works perfectly on my daily commute. Highly recommend!\n\nSarah M. - 4 stars\n\nGood value for money. The sound is clear and the fit is comfortable. Only downside is the case is a bit bulky, but overall I'm satisfied with my purchase.\n\nMike T. - 2 stars\n\nDisappointed with the build quality. The left earbud stopped working after just two weeks. Customer service was slow to respond. Would not recommend.\n\nLisa K. - 5 stars\n\nBest earbuds I've ever owned! The bass is amazing and they stay in place during workouts. Worth every penny.\n\nRobert P. - 3 stars\n\nThey're okay, nothing special. The sound quality is decent but not exceptional. For the price, I expected more features.",
6 "format": "text",
7 "author": null,
8 "published_date": null,
9 "word_count": 187,
10 "reading_time_minutes": 1,
11 "timestamp": "2025-12-18T14:30:00.123Z"
12}💡 Clean Text: All HTML tags, navigation elements, and JavaScript code have been removed. Only the actual review content remains, making it perfect for sentiment analysis models. The word_count field helps you estimate token usage for your classification pipeline.
Using Markdown format preserves structure (headers, lists) which can help with context:
1{
2 "status": "success",
3 "url": "https://news.ycombinator.com/item?id=12345678",
4 "title": "Show HN: I built a perception layer for AI agents",
5 "content": "# Show HN: I built a perception layer for AI agents\n\nWe're excited to share SentienceAPI, a new tool that helps AI agents understand web pages...\n\n## Features\n\n- Fast content extraction\n- Visual element detection\n- Clean markdown output\n\n## Comments\n\n**User123:** This looks amazing! Can't wait to try it.\n\n**User456:** Great work! The API is really fast.\n\n**User789:** Not sure about the pricing model, but the tech is solid.",
6 "format": "markdown",
7 "author": null,
8 "published_date": null,
9 "word_count": 89,
10 "reading_time_minutes": 1,
11 "timestamp": "2025-12-18T14:30:00.123Z"
12}💡 Structured Content: Markdown format preserves headings, lists, and emphasis which can provide additional context for sentiment analysis. Use format: "text" for minimal output or format: "markdown" for structured content.
Standard Scraping
Content Quality: Mixed (HTML + Text)
Accuracy Impact: Lower (Noise pollution)
Processing Time: Slower (More tokens)
Reader Mode
Content Quality: Pure Text/Markdown
Accuracy Impact: Higher (Clean input)
Processing Time: Faster (~400ms)
90% Token Reduction
Process more content in each API call, reducing costs and improving accuracy.