eceee News Fulltext Fetch

Core Goal

Discover news article URLs from https://www.eceee.org/all-news/.
Persist discovered entry metadata into SQLite.
Fetch and extract article body text from each entry page.
Persist status and text in a companion table (entry_content) with retry-safe updates.

Triggering Conditions

Receive a request to extract full text from eceee news archive pages.
Receive a request to run incremental fulltext sync for eceee news links.
Need a resilient local SQLite queue for discovery + extraction + retries.

Workflow

Initialize database.

export ECEEE_NEWS_DB_PATH="/absolute/path/to/eceee_news.db"
python3 scripts/fulltext_fetch.py init-db --db "$ECEEE_NEWS_DB_PATH"

Discover links and fetch fulltext incrementally.

python3 scripts/fulltext_fetch.py sync \
  --db "$ECEEE_NEWS_DB_PATH" \
  --index-url "https://www.eceee.org/all-news/" \
  --limit 50 \
  --min-chars 180

Discover only (refresh URL catalog without fetching bodies).

python3 scripts/fulltext_fetch.py sync \
  --db "$ECEEE_NEWS_DB_PATH" \
  --discover-only

Fetch one entry on demand.

python3 scripts/fulltext_fetch.py fetch-entry \
  --db "$ECEEE_NEWS_DB_PATH" \
  --entry-id 123

Or by URL:

python3 scripts/fulltext_fetch.py fetch-entry \
  --db "$ECEEE_NEWS_DB_PATH" \
  --url "https://www.eceee.org/all-news/news/example-slug/"

Inspect stored state.

python3 scripts/fulltext_fetch.py list-entries --db "$ECEEE_NEWS_DB_PATH" --limit 100
python3 scripts/fulltext_fetch.py list-content --db "$ECEEE_NEWS_DB_PATH" --status ready --limit 100

Data Contract

entries table stores discovery metadata:
- url, title, published_at
- discovered_at, last_seen_at
entry_content table stores extraction result (one row per entry_id):
- source_url, final_url, http_status
- extractor (trafilatura, html-parser, or none)
- content_text, content_hash, content_length
- status (ready or failed)
- retry fields + timestamps

Extraction and Update Rules

Discovery source is https://www.eceee.org/all-news/, extracting anchor tags with class newslink under /all-news/news/.
Fulltext extraction uses article main content region (mainContentColumn) and removes related-news/share blocks.
Extraction path:
1. trafilatura (if installed and not disabled)
2. built-in HTML parser fallback
Upsert by entry_id:
- Success: set ready, write text/hash/length, reset retry counters.
- Failure with existing ready content: keep old content, update error/retry metadata.
- Failure without ready content: set failed, increment retries, set next_retry_at.

Configurable Parameters

--db
ECEEE_NEWS_DB_PATH
--index-url
--discover-only
--limit
--force
--only-failed
--since-date
--refetch-days
--oldest-first
--timeout
--max-bytes
--min-chars
--max-retries
--retry-backoff-minutes
--user-agent
--disable-trafilatura
--fail-on-errors

Error Handling

Index fetch/parse failure returns actionable error.
HTTP/network/content-type failures are recorded per entry and do not stop the whole sync batch.
Short extracted text (< --min-chars) is treated as failed to avoid low-quality bodies.
Retry queue is controlled via max_retries + exponential backoff.

References

references/schema.md
references/fetch-rules.md

Assets

assets/config.example.json

Scripts

scripts/fulltext_fetch.py

eceee-news-fulltext-fetch

Safety Notice

Copy this and send it to your AI assistant to learn

eceee News Fulltext Fetch

Core Goal

Triggering Conditions

Workflow

Data Contract

Extraction and Update Rules

Configurable Parameters

Error Handling

References

Assets

Scripts

Source Transparency

Related Skills

ai-tech-rss-fetch

email-smtp-send

email-imap-fetch

sci-journals-hybrid-search