sustainability-fulltext-fetch

Fetch and persist content for DOI-keyed sustainability RSS entries from a separate fulltext SQLite DB, using OpenAlex/Semantic Scholar API metadata first and webpage fulltext extraction as fallback. Use when building resilient DOI-first content enrichment after relevance labeling.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "sustainability-fulltext-fetch" with this command: npx skills add fadeloo/skills/fadeloo-skills-sustainability-fulltext-fetch

Sustainability Fulltext Fetch

Core Goal

  • Read relevant DOI entries from RSS metadata DB.
  • Write fetched content into a separate fulltext DB.
  • Process only relevant entries (is_relevant=1).
  • Prefer API metadata retrieval by DOI (OpenAlex first, Semantic Scholar fallback).
  • Fallback to webpage fulltext extraction when API metadata is unavailable.
  • Persist one content row per DOI in entry_content.

Triggering Conditions

  • Receive a request to enrich relevant DOI records with abstract/fulltext content.
  • Receive a request to replace webpage-first crawling with API-first enrichment.
  • Need retry-safe incremental updates without duplicate rows.

Workflow

  1. Ensure upstream DOI/relevance data exists.
export SUSTAIN_RSS_DB_PATH="/absolute/path/to/workspace-rss-bot/sustainability_rss.db"
export SUSTAIN_FULLTEXT_DB_PATH="/absolute/path/to/workspace-rss-bot/sustainability_fulltext.db"
python3 scripts/fulltext_fetch.py init-db --content-db "$SUSTAIN_FULLTEXT_DB_PATH"
  1. Run incremental sync (API first, webpage fallback).
python3 scripts/fulltext_fetch.py sync \
  --rss-db "$SUSTAIN_RSS_DB_PATH" \
  --content-db "$SUSTAIN_FULLTEXT_DB_PATH" \
  --limit 50 \
  --openalex-email "you@example.com" \
  --api-min-chars 80 \
  --min-chars 300
  1. Fetch one DOI on demand.
python3 scripts/fulltext_fetch.py fetch-entry \
  --rss-db "$SUSTAIN_RSS_DB_PATH" \
  --content-db "$SUSTAIN_FULLTEXT_DB_PATH" \
  --doi "10.1038/nature12373"
  1. Inspect stored content state.
python3 scripts/fulltext_fetch.py list-content \
  --rss-db "$SUSTAIN_RSS_DB_PATH" \
  --content-db "$SUSTAIN_FULLTEXT_DB_PATH" \
  --status ready \
  --limit 100

Data Contract

  • Reads from RSS DB entries:
    • doi, doi_is_surrogate, is_relevant, canonical_url, url, title.
  • Writes to fulltext DB entry_content (primary key doi):
    • source URL/status/extractor
    • content_kind (abstract or fulltext)
    • content_text, content_hash, content_length
    • retry fields and timestamps.

Extraction Priority

  1. API metadata path:
  • OpenAlex by DOI.
  • Semantic Scholar fallback by DOI.
  • If accepted (--api-min-chars), persist as content_kind=abstract.
  1. Webpage fallback path:
  • Use canonical_url then url.
  • Extract with trafilatura when available, else built-in HTML parser.
  • Persist as content_kind=fulltext.

Update Semantics

  • Upsert key: doi.
  • Success: status ready, reset retry counters.
  • Failure with existing ready row: keep old content, record latest error.
  • Failure without ready row: set status=failed, increment retry state.

Configurable Parameters

  • --rss-db
  • --content-db
  • SUSTAIN_RSS_DB_PATH
  • SUSTAIN_FULLTEXT_DB_PATH
  • --limit
  • --force
  • --only-failed
  • --refetch-days
  • --timeout
  • --max-bytes
  • --min-chars
  • --openalex-email / OPENALEX_EMAIL
  • --s2-api-key / S2_API_KEY
  • --api-timeout
  • --api-min-chars
  • --disable-api-metadata
  • --max-retries
  • --retry-backoff-minutes
  • --user-agent
  • --disable-trafilatura
  • --fail-on-errors

Error Handling

  • Missing DOI-keyed entries table: stop with actionable message.
  • RSS DB and fulltext DB path collision: fail fast and require separate files.
  • API/network/HTTP failures: record failures and continue queue.
  • Webpage non-text content: mark failed for that DOI.
  • Short extraction: fail by threshold to avoid low-quality content.

References

  • references/schema.md
  • references/fetch-rules.md

Assets

  • assets/config.example.json

Scripts

  • scripts/fulltext_fetch.py

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

sustainability-fulltext-fetch

No summary provided by upstream source.

Repository SourceNeeds Review
General

email-imap-fetch

No summary provided by upstream source.

Repository SourceNeeds Review
General

email-smtp-send

No summary provided by upstream source.

Repository SourceNeeds Review
General

ai-tech-rss-fetch

No summary provided by upstream source.

Repository SourceNeeds Review