eceee-news-fulltext-fetch

Discover article URLs from https://www.eceee.org/all-news/ and extract/persist full article text into SQLite with retry-safe incremental sync. Use when building or maintaining an eceee news fulltext corpus for downstream search, indexing, or summarization.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "eceee-news-fulltext-fetch" with this command: npx skills add tiangong-ai/skills/tiangong-ai-skills-eceee-news-fulltext-fetch

eceee News Fulltext Fetch

Core Goal

  • Discover news article URLs from https://www.eceee.org/all-news/.
  • Persist discovered entry metadata into SQLite.
  • Fetch and extract article body text from each entry page.
  • Persist status and text in a companion table (entry_content) with retry-safe updates.

Triggering Conditions

  • Receive a request to extract full text from eceee news archive pages.
  • Receive a request to run incremental fulltext sync for eceee news links.
  • Need a resilient local SQLite queue for discovery + extraction + retries.

Workflow

  1. Initialize database.
export ECEEE_NEWS_DB_PATH="/absolute/path/to/eceee_news.db"
python3 scripts/fulltext_fetch.py init-db --db "$ECEEE_NEWS_DB_PATH"
  1. Discover links and fetch fulltext incrementally.
python3 scripts/fulltext_fetch.py sync \
  --db "$ECEEE_NEWS_DB_PATH" \
  --index-url "https://www.eceee.org/all-news/" \
  --limit 50 \
  --min-chars 180
  1. Discover only (refresh URL catalog without fetching bodies).
python3 scripts/fulltext_fetch.py sync \
  --db "$ECEEE_NEWS_DB_PATH" \
  --discover-only
  1. Fetch one entry on demand.
python3 scripts/fulltext_fetch.py fetch-entry \
  --db "$ECEEE_NEWS_DB_PATH" \
  --entry-id 123

Or by URL:

python3 scripts/fulltext_fetch.py fetch-entry \
  --db "$ECEEE_NEWS_DB_PATH" \
  --url "https://www.eceee.org/all-news/news/example-slug/"
  1. Inspect stored state.
python3 scripts/fulltext_fetch.py list-entries --db "$ECEEE_NEWS_DB_PATH" --limit 100
python3 scripts/fulltext_fetch.py list-content --db "$ECEEE_NEWS_DB_PATH" --status ready --limit 100

Data Contract

  • entries table stores discovery metadata:
    • url, title, published_at
    • discovered_at, last_seen_at
  • entry_content table stores extraction result (one row per entry_id):
    • source_url, final_url, http_status
    • extractor (trafilatura, html-parser, or none)
    • content_text, content_hash, content_length
    • status (ready or failed)
    • retry fields + timestamps

Extraction and Update Rules

  • Discovery source is https://www.eceee.org/all-news/, extracting anchor tags with class newslink under /all-news/news/.
  • Fulltext extraction uses article main content region (mainContentColumn) and removes related-news/share blocks.
  • Extraction path:
    1. trafilatura (if installed and not disabled)
    2. built-in HTML parser fallback
  • Upsert by entry_id:
    • Success: set ready, write text/hash/length, reset retry counters.
    • Failure with existing ready content: keep old content, update error/retry metadata.
    • Failure without ready content: set failed, increment retries, set next_retry_at.

Configurable Parameters

  • --db
  • ECEEE_NEWS_DB_PATH
  • --index-url
  • --discover-only
  • --limit
  • --force
  • --only-failed
  • --since-date
  • --refetch-days
  • --oldest-first
  • --timeout
  • --max-bytes
  • --min-chars
  • --max-retries
  • --retry-backoff-minutes
  • --user-agent
  • --disable-trafilatura
  • --fail-on-errors

Error Handling

  • Index fetch/parse failure returns actionable error.
  • HTTP/network/content-type failures are recorded per entry and do not stop the whole sync batch.
  • Short extracted text (< --min-chars) is treated as failed to avoid low-quality bodies.
  • Retry queue is controlled via max_retries + exponential backoff.

References

  • references/schema.md
  • references/fetch-rules.md

Assets

  • assets/config.example.json

Scripts

  • scripts/fulltext_fetch.py

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

ai-tech-rss-fetch

No summary provided by upstream source.

Repository SourceNeeds Review
General

email-smtp-send

No summary provided by upstream source.

Repository SourceNeeds Review
General

email-imap-fetch

No summary provided by upstream source.

Repository SourceNeeds Review
General

sci-journals-hybrid-search

No summary provided by upstream source.

Repository SourceNeeds Review