Docs/Use Cases/Sentiment_Analysis

Sentiment Analysis

Extract clean text from reviews, social media, and articles. Perfect input for GPT-4, Claude, and other classification models.

The Content Extraction Challenge

Sentiment analysis models need clean, readable text. Raw HTML and JavaScript-rendered content create noise that degrades accuracy:

HTML Noise

Navigation menus, footers, and ad scripts pollute the text. Classification models struggle to identify actual content vs. boilerplate.

Dynamic Content

Client-side rendered reviews and comments are invisible to static scrapers. You miss the actual user-generated content.

The Solution: Clean Text Extraction

Reader Mode extracts only the semantic content, removing all HTML noise. The result is clean Markdown or plain text perfect for sentiment analysis:

1

Full Page Rendering

We render the complete page with JavaScript execution, capturing all dynamically loaded content including reviews, comments, and user-generated text.

2

Content Isolation

Our algorithms automatically strip navigation, ads, cookie banners, and other non-content elements. Only the actual article or review text remains.

3

Format Options

Choose Markdown (structured) or plain text (minimal) format. Both are optimized for LLM consumption and classification models.

Workflow Examples

Common patterns for sentiment analysis pipelines

Product Review Analysis

Extract reviews from e-commerce sites and analyze sentiment:

1import requests
2from openai import OpenAI
3
4# 1. Extract clean text from review page
5response = requests.post(
6    "https://api.sentienceapi.com/v1/observe",
7    headers={"Authorization": "Bearer YOUR_API_KEY"},
8    json={
9        "url": "https://www.amazon.com/product-reviews/B08N5WRWNW",
10        "mode": "read",
11        "format": "text"  # Plain text for cleaner analysis
12    }
13)
14
15data = response.json()
16review_text = data["content"]
17
18# 2. Analyze sentiment with GPT-4
19client = OpenAI()
20sentiment = client.chat.completions.create(
21    model="gpt-4",
22    messages=[{
23        "role": "system",
24        "content": "Analyze sentiment: positive, negative, or neutral"
25    }, {
26        "role": "user",
27        "content": review_text
28    }]
29)
30
31print(f"Sentiment: {sentiment.choices[0].message.content}")

✅ Outcome: Clean review text without HTML noise, ready for accurate sentiment classification.

Batch Processing Articles

Process multiple articles for sentiment tracking:

1urls = [
2    "https://techcrunch.com/article1",
3    "https://techcrunch.com/article2",
4    "https://techcrunch.com/article3"
5]
6
7sentiments = []
8for url in urls:
9    # Extract content (1 credit per request)
10    response = requests.post(
11        "https://api.sentienceapi.com/v1/observe",
12        headers={"Authorization": "Bearer YOUR_API_KEY"},
13        json={"url": url, "mode": "read", "format": "text"}
14    )
15    
16    content = response.json()["content"]
17    
18    # Analyze sentiment
19    sentiment = analyze_with_claude(content)
20    sentiments.append({
21        "url": url,
22        "sentiment": sentiment,
23        "word_count": response.json()["word_count"]
24    })
25
26print(f"Processed {len(sentiments)} articles")

✅ Outcome: Fast batch processing at 1 credit per article (~400ms each).

Sample Response from Reader Service

Clean, normalized content ready for sentiment analysis:

1{
2  "status": "success",
3  "url": "https://www.amazon.com/product-reviews/B08N5WRWNW",
4  "title": "Customer reviews: Wireless Earbuds",
5  "content": "Customer Reviews\n\nJohn D. - 5 stars\n\nThese earbuds are absolutely fantastic! The sound quality is incredible and the battery life lasts all day. I've been using them for a month now and couldn't be happier. The noise cancellation works perfectly on my daily commute. Highly recommend!\n\nSarah M. - 4 stars\n\nGood value for money. The sound is clear and the fit is comfortable. Only downside is the case is a bit bulky, but overall I'm satisfied with my purchase.\n\nMike T. - 2 stars\n\nDisappointed with the build quality. The left earbud stopped working after just two weeks. Customer service was slow to respond. Would not recommend.\n\nLisa K. - 5 stars\n\nBest earbuds I've ever owned! The bass is amazing and they stay in place during workouts. Worth every penny.\n\nRobert P. - 3 stars\n\nThey're okay, nothing special. The sound quality is decent but not exceptional. For the price, I expected more features.",
6  "format": "text",
7  "author": null,
8  "published_date": null,
9  "word_count": 187,
10  "reading_time_minutes": 1,
11  "timestamp": "2025-12-18T14:30:00.123Z"
12}

💡 Clean Text: All HTML tags, navigation elements, and JavaScript code have been removed. Only the actual review content remains, making it perfect for sentiment analysis models. The word_count field helps you estimate token usage for your classification pipeline.

Markdown Format Example

Using Markdown format preserves structure (headers, lists) which can help with context:

1{
2  "status": "success",
3  "url": "https://news.ycombinator.com/item?id=12345678",
4  "title": "Show HN: I built a perception layer for AI agents",
5  "content": "# Show HN: I built a perception layer for AI agents\n\nWe're excited to share SentienceAPI, a new tool that helps AI agents understand web pages...\n\n## Features\n\n- Fast content extraction\n- Visual element detection\n- Clean markdown output\n\n## Comments\n\n**User123:** This looks amazing! Can't wait to try it.\n\n**User456:** Great work! The API is really fast.\n\n**User789:** Not sure about the pricing model, but the tech is solid.",
6  "format": "markdown",
7  "author": null,
8  "published_date": null,
9  "word_count": 89,
10  "reading_time_minutes": 1,
11  "timestamp": "2025-12-18T14:30:00.123Z"
12}

💡 Structured Content: Markdown format preserves headings, lists, and emphasis which can provide additional context for sentiment analysis. Use format: "text" for minimal output or format: "markdown" for structured content.

Why Reader Mode for Sentiment Analysis?

Standard Scraping

Content Quality: Mixed (HTML + Text)

Accuracy Impact: Lower (Noise pollution)

Processing Time: Slower (More tokens)

Reader Mode

Content Quality: Pure Text/Markdown

Accuracy Impact: Higher (Clean input)

Processing Time: Faster (~400ms)

90% Token Reduction

Process more content in each API call, reducing costs and improving accuracy.