ai-tech-fulltext-fetch

Fetch and persist article full text for RSS entries already stored in SQLite by ai-tech-rss-fetch. Use when backfilling or incrementally syncing body text from entries.url or entries.canonical_url into a companion table for downstream indexing, retrieval, or summarization.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "ai-tech-fulltext-fetch" with this command: npx skills add fadeloo/skills/fadeloo-skills-ai-tech-fulltext-fetch

AI Tech Fulltext Fetch

Core Goal

  • Reuse the same SQLite database populated by ai-tech-rss-fetch.
  • Fetch article body text from each RSS entry URL.
  • Persist extraction status and text in a companion table (entry_content).
  • Support incremental runs and safe retries without creating duplicate fulltext rows.

Triggering Conditions

  • Receive a request to fetch article body/full text for entries already in ai_rss.db.
  • Receive a request to build a second-stage pipeline after RSS metadata sync.
  • Need a stable, resumable queue over existing entries rows.
  • Need URL-based fulltext persistence before chunking, indexing, or summarization.

Workflow

  1. Ensure metadata table exists first.
  • Run ai-tech-rss-fetch and populate entries in SQLite before using this skill.
  • This skill requires the entries table to exist.
  • In multi-agent runtimes, pin DB to the same absolute path used by ai-tech-rss-fetch:
export AI_RSS_DB_PATH="/absolute/path/to/workspace-rss-bot/ai_rss.db"
  1. Initialize fulltext table.
python3 scripts/fulltext_fetch.py init-db --db "$AI_RSS_DB_PATH"
  1. Run incremental fulltext sync.
  • Default behavior fetches rows that are missing full text or currently failed.
python3 scripts/fulltext_fetch.py sync \
  --db "$AI_RSS_DB_PATH" \
  --limit 50 \
  --timeout 20 \
  --min-chars 300
  1. Fetch one entry on demand.
python3 scripts/fulltext_fetch.py fetch-entry \
  --db "$AI_RSS_DB_PATH" \
  --entry-id 1234
  1. Inspect extracted content state.
python3 scripts/fulltext_fetch.py list-content \
  --db "$AI_RSS_DB_PATH" \
  --status ready \
  --limit 100

Data Contract

  • Reads from existing entries table:
    • id, canonical_url, url, title.
  • Writes to entry_content table:
    • entry_id (unique, one row per entry)
    • source_url, final_url, http_status
    • extractor (trafilatura, html-parser, or none)
    • content_text, content_hash, content_length
    • status (ready or failed)
    • retry_count, last_error, timestamps.

Extraction and Update Rules

  • URL source priority: canonical_url first, fallback to url.
  • Attempt trafilatura extraction when dependency is available, fallback to built-in HTML parser.
  • Upsert by entry_id:
    • Success: write/update full text and reset retry_count to 0.
    • Failure with existing ready content: keep old text, keep status ready, record last_error.
    • Failure without ready content: status becomes failed, increment retry_count, set next_retry_at.
  • Failed retries are capped by --max-retries (default 3) and paced by --retry-backoff-minutes.
  • --force allows refetching already ready rows.
  • --refetch-days N allows refreshing rows older than N days.

Configurable Parameters

  • --db
  • AI_RSS_DB_PATH (recommended absolute path in multi-agent runtime)
  • --limit
  • --force
  • --only-failed
  • --refetch-days
  • --oldest-first
  • --timeout
  • --max-bytes
  • --min-chars
  • --max-retries
  • --retry-backoff-minutes
  • --user-agent
  • --disable-trafilatura
  • --fail-on-errors

Error Handling

  • Missing entries table: return actionable error and stop.
  • Network/HTTP/parse errors: store failure state and continue processing other entries.
  • Non-text content types (PDF/image/audio/video/zip): mark as failed for that entry.
  • Extraction too short (--min-chars): treat as failure to avoid low-quality body text.

References

  • references/schema.md
  • references/fetch-rules.md

Assets

  • assets/config.example.json

Scripts

  • scripts/fulltext_fetch.py

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

ai-tech-fulltext-fetch

No summary provided by upstream source.

Repository SourceNeeds Review
General

email-imap-fetch

No summary provided by upstream source.

Repository SourceNeeds Review
General

email-smtp-send

No summary provided by upstream source.

Repository SourceNeeds Review
General

ai-tech-summary

No summary provided by upstream source.

Repository SourceNeeds Review