Real-time news aggregation system that continuously scrapes search engines (Google by default, with pluggable Bing/Yahoo/Brave backups) — the News tab, not Google News — for any topics (search keywords) and streams updates via WebSocket.
Google News (https://news.google.com) and Google News RSS (https://news.google.com/rss?search=<keyword>) provide curated news collections based on Google's algorithms. While convenient, they have limitations:
- Results are not necessarily the latest - articles may be hours or days old
- Google filters by quality and relevance, potentially missing breaking news
- No control over what Google considers "newsworthy"

Google News Search result - hours or days old

Google News RSS - same as Google News search
TopicStreams scrapes search engines' News results with time filters — Google Search's News tab by default, plus Bing/Yahoo/Brave — giving you:
- Real-time results - All news the engine indexes, regardless of quality rating
- Unfiltered access - No curation, you decide what's relevant
- Near-instant updates - Scrape frequently enough and catch news as it breaks
- Full control - Customize topics (search keywords) and scrape intervals
- Multiple engines - Pluggable search sources (Google, Bing, Yahoo, Brave) with fallback/all/rotate strategies; see Search Engines

Google Search News Tab - Latest, Unfiltered Results
Experience TopicStreams in action: http://topicstreams.dongziyu.com
# Add topics (ensure they exist)
curl -X POST http://topicstreams.dongziyu.com/api/v1/topics \
-H "Content-Type: application/json" \
-d '{"name": "Bitcoin"}'
# List all active topics (contain "bitcoin")
curl http://topicstreams.dongziyu.com/api/v1/topics | jq
# Get latest news for "Bitcoin"
curl http://topicstreams.dongziyu.com/api/v1/news/bitcoin?limit=5 | jqFor real-time news updates, connect via WebSocket:
# Real-time WebSocket news stream for an existing topic
# (add the topic first via POST /api/v1/topics — the WS doesn't create topics)
websocat ws://topicstreams.dongziyu.com/api/v1/ws/news/china | jqThe WebSocket delivers live news updates as they're scraped, showing the same content you'd see by continuously refreshing Google's news search page.

WebSocket Real-time News Stream - Live updates as articles are scraped
- Real-time news streaming on customizable topics (any search keywords)
- Self-hosted - No third-party news API costs
- Search-engine dependency - Black box algorithms, no source control, variable indexing speed, geographic filtering (Google is the default engine; Bing/Yahoo/Brave have the same trade-off)
- Inconsistent Results - Same queries return different results based on IP, geolocation, browser, A/B testing
- No Quality Control - All news included, credible or not
- Access Risks - Engines may detect scraping and rate limit or block access; mitigations: Anti-Bot Detection and adaptive per-engine cooldown
- Real-time News Aggregation - Continuously scrapes search engines' News results (Google Search's News tab by default — not the Google News site — plus Bing/Yahoo/Brave) for the latest articles
- Multi-Topic Tracking - Monitor multiple news topics simultaneously with configurable scrape intervals
- WebSocket Streaming - Subscribe to live news updates per topic via WebSocket connections
- REST API - Manage topics and retrieve historical news entries through HTTP endpoints
- Anti-Bot Detection - Native Chromium with runtime-derived browser fingerprinting and automation-signal hardening (playwright-stealth is deliberately disabled — Google detects its JS patches) (details)
TopicStreams consists of three main components:
┌─────────────────────────┐
│ Client │
│ (REST API / WebSocket) │
└────────────┬────────────┘
│
▼
┌─────────────────────────┐ ┌──────────────────────────────┐
│ FastAPI Server │ │ Scraper Service │
│ │ │ │
│ - REST endpoints │ │ - Per-engine parallel │
│ - WebSocket streams │ │ workers (Playwright) │
│ - PostgreSQL listener │ │ - BeautifulSoup parser │
└────────────┬────────────┘ └─────────────┬────────────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────────────────────┐
│ PostgreSQL Database │
│ │
│ - Topics (tracked keywords) │
│ - News Entries (scraped articles) │
│ - Scraper Logs (monitoring) │
│ - LISTEN/NOTIFY for real-time updates │
└─────────────────────────────────────────────────────────────┘
- Scraper Service runs one parallel worker per configured search engine (Google's News tab by default, plus Bing/Yahoo/Brave), each continuously sweeping the tracked topics at its own paced rate
- New articles are inserted into PostgreSQL with automatic deduplication
- Database triggers send NOTIFY events on new inserts
- FastAPI Server listens for these events via PostgreSQL's LISTEN/NOTIFY
- Updates are pushed to connected WebSocket clients in real-time
- Clients can also fetch historical data via REST API
- FastAPI - Web framework for REST and WebSocket
- Playwright - Browser automation with anti-bot detection (see how it works)
- PostgreSQL - Reliable storage with LISTEN/NOTIFY for real-time events
- Docker - Container orchestration for easy deployment
- Docker installed on your system
That's it! All dependencies (Python, PostgreSQL, Playwright browsers) are handled inside containers.
Optional: Install websocat for WebSocket testing (used for demo in this article), or use any WebSocket client you prefer.
TopicStreams ships a responsive Web UI styled as a news wire desk — a single live transmission feed indexed by time, in light or dark themes.
- The Wire - One chronological stream of all watched topics, newest first. Live entries arrive over WebSocket and prepend at the top; scrolling down loads earlier entries indefinitely (cursor pagination). Filter to a single topic or view all. A ↑ latest button appears once you scroll down, jumping back to the live edge.
- Auto-watched topics - Every topic you track is watched automatically (no per-topic subscribe step). Topics show as removable slugs; add one from the same row.
- Scrape-health indicator - The masthead status reflects the real state of the scraper, derived from recent scraper logs:
live,degraded(some feeds failing),errors(all feeds failing),stalled(no recent scrapes),idle(none yet), oroffline(API unreachable). Hover it for detail. - Light / dark themes - Toggle in the masthead; defaults to your system preference and is remembered across visits.
After Quick Start, simply open your browser and navigate to:
http://localhost:5000
Note: By default, the Web UI is accessible on port 5000. If you changed
HOST_PORTin your.envfile (e.g., set to80for production), use that port instead (e.g.,http://localhost:80).
TopicStreams Web UI — "the wire": a live, time-indexed news transmission feed
Scrape-health is server-computed (
GET /api/v1/status) from recentscraper_logs, refreshed every 30s. It detects selector rot: if Google changes its News-tab markup so scrapes return HTTP 200 but parse 0 entries across the board, the masthead showsno items(stateparsing) instead of silently readinglive. A single topic with genuinely no news this hour is safe — it only trips when every recent scrape parses nothing.
A built-in observability console at /monitor (linked from the wire masthead) gives a per-engine, per-cycle view of scraper health — no Prometheus/Grafana stack required, since the data already lives in Postgres. It polls GET /api/v1/metrics over a selectable window (1h / 6h / 24h) and shows:
- Overall strip - active topics, total filed, scrape success rate, feed freshness, last-cycle duration, and scrapes-in-window (blocked / failed).
- Engines table - one row per engine with a health label (
healthy/degraded/blocked/parsing/cooldown/idle), success %, fetch latency (avg / p95), items parsed, 0-parse count (selector-rot signal), blocks (429/403/503), and last HTTP status.blockednow also covers connection-level teardowns with no HTTP status (e.g. Yahoo'sERR_CONNECTION_CLOSED). - Recent cycles - a sparkline of per-cycle durations plus a list (duration, topics, parsed, new events).
A throttled engine shows up here rather than silently vanishing from the feed. When the scraper benches an engine, the table shows cooldown with a countdown to the next probe — so an engine that produces no scrapes while cooling stays visible instead of disappearing. See Observability for details.
The /monitor ops page — per-engine health, latency, and the recent-cycle timeline
git clone https://github.com/zydo/topicstreams.git
cd topicstreamsCreate your .env first — the stack fails fast with a clear message if it's missing:
cp .env.example .env
docker compose up -dThe defaults in .env.example work out-of-the-box; edit .env to customize ports, credentials, or the optional API auth token(s) (see Authentication & Security). config.yml is created from its .yml.example template on first run, so you only need to copy it when you want to change scraper or API settings:
cp config.yml.example config.ymlThis will start three containers:
- postgres - Database
- scraper - Background scraping service
- api - FastAPI server http://localhost:5000 (or port set by
HOST_PORTin.env)
# Add a topic (replace 5000 with your HOST_PORT if changed)
curl -X POST http://localhost:5000/api/v1/topics \
-H "Content-Type: application/json" \
-d '{"name": "artificial intelligence"}'Scraping of the topic will start on the next iteration.
WebSocket (for real-time):
# Using websocat
websocat ws://localhost:5000/api/v1/ws/news/artificial+intelligence
# Or with jq for prettier formatted output
websocat ws://localhost:5000/api/v1/ws/news/artificial+intelligence | jqREST API (for historical data):
# Get the latest 5 news entries for a topic (newest first)
curl "http://localhost:5000/api/v1/news/artificial+intelligence?limit=5" | jq
# Page back to older entries with the cursor from the previous response
curl "http://localhost:5000/api/v1/news/artificial+intelligence?limit=5&before_id=104" | jq
# Latest 5 across all topics
curl "http://localhost:5000/api/v1/news?limit=5" | jq
# List all actively scraping topics
curl http://localhost:5000/api/v1/topics | jq
# List recent 10 scraper logs (each log represents one search-engine page load - typically up to 10 news entries)
curl http://localhost:5000/api/v1/logs?limit=10 | jqSee the API Reference section below for complete endpoint documentation.
# Background scraper logs
docker compose logs -f scraper
# FastAPI server logs
docker compose logs -f apidocker compose downFor complete configuration documentation including environment variables, YAML files, and browser fingerprinting settings, see Configuration.
Quick links:
- Environment variables - Database and API settings in .env
- Scraper settings - scrape_interval, max_pages, engines, pacing, cooldown, saturation
- Anti-detection settings - Browser fingerprinting and stealth strategies
- Reloading config - How to apply configuration changes
TopicStreams uses sophisticated techniques to make the scraper appear as a real human user, minimizing the risk of being blocked by Google.
For detailed information about anti-detection strategies (Playwright stealth, browser fingerprinting, random delays, etc.), see Anti-Bot Detection Documentation.
Quick Reference:
- All anti-detection strategies are configurable via
config.yml(auto-created from the template on first run) - See Configuration for YAML configuration details
For detailed information about scraping behavior, monitoring, and scaling strategies, see Scraping Behavior.
Quick links:
- Per-engine parallel workers - Each engine runs concurrently in its own worker; topics within an engine stay sequential and paced
- Proactive pacing vs. cooldown - Per-engine pace floor is the primary throttle; cooldown is the reactive backstop
- Scrape interval - How scrape_interval sets each worker's sweep period
- Exit-IP saturation - When the signal says to scale out to another IP
- Monitoring - Track scraper performance
- Proxy support - Route the scraper through residential/mobile proxies (in practice required — Google blocks direct automated access to the News tab)
Ships a few built-in controls (optional Bearer-token auth on the REST API, per-IP rate limiting, CORS, WS that can't create topics); beyond them it assumes a localhost/LAN or behind-a-reverse-proxy deployment. See Authentication & Security for what's covered and further hardening (JWT/OAuth2, Cloudflare).
Auth is off by default (dev mode) — every endpoint is open. Once any token is
configured, all REST endpoints require an Authorization: Bearer <token>
header matching a valid token. (WebSocket connections are not yet authenticated.)
Valid tokens come from two sources, used together:
| Source | Where | Takes effect | Best for |
|---|---|---|---|
| Env bootstrap | TOPICSTREAMS_API_KEY in .env (comma-separated) |
On container recreate | A break-glass/admin key that always works |
| DB-backed keys | api_keys table, managed via CLI |
Live, within ~30s, no restart | Day-to-day keys you add/revoke per client |
Generate a token with the bundled helper (cryptographically strong, URL-safe):
python scripts/generate_api_key.py # one token
python scripts/generate_api_key.py -n 3 # three, comma-separated
python scripts/generate_api_key.py --bytes 48 # stronger (more entropy)Option A — env bootstrap key (.env). Survives DB loss and is the
recommended way to seed the first/admin token. Editing it needs a container
recreate:
# .env
TOPICSTREAMS_API_KEY=admin-tok
docker compose up -d api # recreates the container with the new env
docker compose restart apiis not enough for.envchanges: it's injected viaenv_fileat container-creation time, so a plain restart keeps the old value. Useup -d(add--force-recreateif it reports "up to date").
Option B — DB-backed keys (live, no restart). Add/disable tokens with the
management CLI; the API picks them up within api_key_cache_ttl_seconds
(default 30s) — no restart:
# add a key (mints + prints a token; use --label to name it)
docker compose exec api python scripts/manage_api_keys.py add --label alice
# list keys (tokens are masked)
docker compose exec api python scripts/manage_api_keys.py listOnce auth is on, pass the token in the Authorization: Bearer <token> header on
every REST request:
TOKEN=alice-tok # an env bootstrap token or a DB-backed one
# List topics
curl -H "Authorization: Bearer $TOKEN" \
http://localhost:5000/api/v1/topics | jq
# Add a topic
curl -X POST http://localhost:5000/api/v1/topics \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"name": "artificial intelligence"}'
# Get latest news for a topic
curl -H "Authorization: Bearer $TOKEN" \
"http://localhost:5000/api/v1/news/artificial+intelligence?limit=5" | jq
# Delete (stop tracking) a topic
curl -X DELETE \
-H "Authorization: Bearer $TOKEN" \
http://localhost:5000/api/v1/topics/artificial+intelligenceWithout a valid token, protected requests return 401 Unauthorized:
{ "error": "UNAUTHORIZED", "message": "Invalid or missing API token", "status": "error" }Web UI: when auth is on, the UI (and the /monitor page) prompts once for a
token, stores it in your browser (localStorage), and sends it on every request.
WebSocket streams (
/api/v1/ws/...) are not authenticated yet, so they take no token — see the caveat under Accessing Real-Time News.
DB-backed keys (live — recommended for day-to-day). Managed with
scripts/manage_api_keys.py; changes apply within api_key_cache_ttl_seconds
(default 30s) with no restart:
# inside the API container (shares the DB connection settings):
docker compose exec api python scripts/manage_api_keys.py add --label alice # add + print a token
docker compose exec api python scripts/manage_api_keys.py list # list (id, label, masked token, active)
docker compose exec api python scripts/manage_api_keys.py disable 3 # revoke key #3
docker compose exec api python scripts/manage_api_keys.py enable 3 # re-enable key #3
docker compose exec api python scripts/manage_api_keys.py delete 3 # remove key #3- Add / revoke —
add/disable(by id fromlist); effective within the cache TTL, no restart. - Rotate —
adda new key, move clients over, thendisable/deletethe old one. - Hand a distinct key (with a
--label) to each client so you can revoke one without disrupting the rest.
Env bootstrap key (TOPICSTREAMS_API_KEY). Comma-separated; an always-valid
fallback that's independent of the DB. Any change requires a container
recreate (docker compose up -d api) — a plain docker compose restart api
keeps the old value. Unset it (and have no active DB keys) to return to open dev
mode.
The two sources are unioned: the env key is your break-glass/admin token (works even if the DB is empty or a brand-new key hasn't propagated yet), while DB keys are the live-manageable set. Tokens are stored in plaintext and there's no per-token usage audit yet — for stronger key management put the API behind a gateway (see further hardening below).
Quick links:
- Built-in controls - Bearer-token auth, rate limiting, CORS
- Not covered - Gaps to close before public exposure
- Further hardening - JWT/OAuth2, edge rate limiting, Cloudflare
Real-time fanout already rides on Postgres
LISTEN/NOTIFY, which works across multiple API replicas as-is — each replica listens and fans out to its own clients. The only multi-replica-unsafe piece is the in-process rate limiter. See WebSocket Scalability.
Quick links:
- How fanout works - Scraper → Postgres NOTIFY → per-replica LISTEN → local clients
- Multiple replicas - Fanout already works; the rate limiter is the open item
- Scaling ceiling - When (and only when) a Redis/Kafka bus would be warranted
For complete API documentation including all endpoints, request/response formats, and examples, see API Reference.
Quick links:
- Topics - List, add, and delete topics
- News - Get news entries for topics with pagination
- Logs - View scraper logs
- WebSocket - Real-time news updates via WebSocket
For database access, common SQL queries, backup, and restore instructions, see Database Access.
Quick links:
- psql access - Connect to PostgreSQL database
- Common queries - Topics, news entries, scraper logs, and table sizes
- Backup & restore - Database backup and restore procedures
MIT