When to Use This Skill
Activate when the user wants to obtain data from a website:
- "Extract all product prices from this page"
- "Scrape the table of results from ..."
- "Pull the list of authors and titles from arXiv search results"
- "Collect all job listings from this page"
- "Get the data from this dashboard table"
- "Harvest review scores from ..."
- "Download all the links/images/cards from ..."
The deliverable is always two artifacts:
- Executable Playwright script — a standalone
.cjsfile that reproduces the extraction without Actionbook at runtime. - Extracted data — JSON (default), CSV, or user-specified format written to disk.
Decision Strategy
Use Actionbook as a conditional accelerator, not a mandatory step. The goal is reliable selectors in the shortest path.
User request
│
├─► actionbook search "<site> <intent>"
│ ├─ Results with Health Score ≥ 70% ──► actionbook get "<ID>" ──► use selectors
│ └─ No results / low score ──► Fallback
│
└─► Fallback: actionbook browser open <url>
├─ actionbook browser snapshot (accessibility tree → find selectors)
├─ actionbook browser screenshot (visual confirmation)
└─ manual selector discovery via DOM inspection
Priority order for selector sources:
| Priority | Source | When |
|---|---|---|
| 1 | actionbook get | Site is indexed, health score ≥ 70% |
| 2 | actionbook browser snapshot | Not indexed or selectors outdated |
| 3 | DOM inspection via screenshot + snapshot | Complex SPA / dynamic content |
Non-negotiable rule: if search + get already provides usable selectors for required fields, start from get selectors and do not jump to full fallback (snapshot/screenshot) by default. Exception: lightweight mechanism probes (for hydration/virtualization/pagination) are allowed when runtime behavior may affect script correctness. Escalate to snapshot/screenshot only when probes/sample validation indicate selector gaps or instability.
Mechanism-Aware Script Strategy
Websites use patterns that break naive scraping. The generated Playwright script must account for these:
Streaming / SSR / RSC hydration
Pages may render a shell first, then stream or hydrate content.
// Wait for hydration to complete — not just DOMContentLoaded
await page.waitForSelector('[data-item]', { state: 'attached' });
await page.waitForFunction(() => {
const items = document.querySelectorAll('[data-item]');
return items.length > 0 && !document.querySelector('[data-pending]');
});
Detection cues: React root with data-reactroot, Next.js __NEXT_DATA__, empty containers that fill after JS runs. If actionbook browser text "<selector>" returns empty but the screenshot shows content, hydration hasn't completed.
Virtualized lists / virtual DOM
Only visible rows exist in the DOM. Scrolling renders new rows and destroys old ones.
// Scroll-and-collect loop for virtualized lists (scroll container aware)
const allItems = [];
const maxScrolls = 50;
let scrolls = 0;
const container = await page.$('<scroll-container-selector>');
if (!container) throw new Error('Scroll container not found');
let previousTop = await container.evaluate(el => el.scrollTop);
while (scrolls < maxScrolls) {
const items = await page.$$eval('[data-row]', rows =>
rows.map(r => ({ text: r.textContent.trim() }))
);
for (const item of items) {
if (!allItems.find(i => i.text === item.text)) allItems.push(item);
}
await container.evaluate(el => el.scrollBy(0, 600));
await page.waitForTimeout(300);
const currentTop = await container.evaluate(el => el.scrollTop);
if (currentTop === previousTop) break;
previousTop = currentTop;
scrolls += 1;
}
Detection cues: Container has fixed height with overflow: auto/scroll, row count in DOM is much smaller than stated total, rows have transform: translateY(...) or position: absolute; top: ...px.
Infinite scroll / lazy loading
New content appends when the user scrolls near the bottom.
// Scroll to bottom until no new content loads (with no-growth tolerance)
let itemCount = 0;
let noGrowthStreak = 0;
const maxScrolls = 80;
let scrolls = 0;
while (scrolls < maxScrolls && noGrowthStreak < 3) {
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await page.waitForTimeout(1200);
const newCount = await page.$$eval('.item', els => els.length);
if (newCount > itemCount) {
itemCount = newCount;
noGrowthStreak = 0;
} else {
noGrowthStreak += 1;
}
scrolls += 1;
}
Detection cues: Intersection Observer in page JS, "Load more" button, sentinel element at bottom, network requests firing on scroll.
Pagination
Multi-page results behind "Next" buttons or numbered pages.
// Click-through pagination (navigation-aware, SPA-safe)
const allData = [];
const maxPages = 50;
let pageIndex = 0;
while (pageIndex < maxPages) {
const pageData = await page.$$eval('.result-item', items =>
items.map(el => ({ title: el.querySelector('h3')?.textContent?.trim() }))
);
allData.push(...pageData);
const nextBtn = await page.$('a.next-page:not([disabled])');
if (!nextBtn) break;
const previousUrl = page.url();
const previousFirstItem = await page
.$eval('.result-item', el => el.textContent?.trim() || '')
.catch(() => '');
await nextBtn.click();
// Post-click detection only: advance must be caused by this click
const advanced = await Promise.any([
page
.waitForURL(url => url.toString() !== previousUrl, { timeout: 5000 })
.then(() => true),
page
.waitForFunction(
prev => {
const first = document.querySelector('.result-item');
return !!first && (first.textContent || '').trim() !== prev;
},
previousFirstItem,
{ timeout: 5000 }
)
.then(() => true),
]).catch(() => false);
if (!advanced) break;
await page.waitForLoadState('networkidle').catch(() => {});
pageIndex += 1;
}
Execution Chain
Step 1: Understand the target
Identify from the user request:
- URL — the page to extract from
- Data shape — what fields / columns are needed
- Scope — single page, paginated, infinite scroll, or multi-page crawl
- Output format — JSON (default), CSV, or other
Step 2: Obtain selectors and choose execution path
# Try Actionbook index first
actionbook search "<site> <data-description>" --domain <domain>
# If good results (health ≥ 70%), get full selectors
actionbook get "<ID>"
Use this routing strictly:
-
Path A (default when
getis good): requested fields are covered bygetselectors and quality is acceptable.- Start from
getselectors and move to script draft quickly. - You may run lightweight mechanism probes (
browser text, quick scroll checks) before finalizing script strategy. - Do not run full fallback (
snapshot/screenshot) before first draft unless probe/sample validation shows mismatch. - Field mapping must default to
getselectors and mark source asactionbook_get.
- Start from
-
Path B (partial / unstable):
getexists but required fields are missing, selector resolves 0 elements, or validation fails.- Run targeted fallback only for failed fields/steps.
-
Path C (no usable coverage): search/get has no usable result.
- Run full fallback discovery.
Step 3: Probe page mechanisms and fallback only when needed
Path A mechanism detection timing:
- Run minimal probes either before final script draft or during sample validation.
- Before any probe command, ensure the correct page context is open:
actionbook browser open "<url>"(if current tab context is unknown/stale)
- If probes/sample run indicate mismatch (missing rows, unstable selectors, wrong pagination behavior), escalate to Path B targeted fallback.
Fallback discovery by path:
Path B targeted fallback (only failed fields/steps):
actionbook browser open "<url>" # if not already open
actionbook browser snapshot # focus on failed field/container mapping
# actionbook browser screenshot # optional visual confirmation for failed area
Path C full fallback (no usable coverage):
actionbook browser open "<url>"
actionbook browser snapshot
actionbook browser screenshot
Mechanism probes (run when script strategy needs confirmation):
# Hydration / streaming check
actionbook browser text "<container-selector>"
# Infinite scroll quick signal (explicit before/after decision)
actionbook browser eval "document.querySelectorAll('<item-selector>').length" # before
actionbook browser click "<scroll-container-selector-or-body>" # focus scroll context
actionbook browser eval "const c=document.querySelector('<scroll-container-selector>') || document.scrollingElement; c.scrollBy(0, c.clientHeight || window.innerHeight);"
actionbook browser eval "document.querySelectorAll('<item-selector>').length" # after
# If count increases, treat page as lazy-load/infinite-scroll.
Fallback trigger conditions:
actionbook getcannot map all required fields.actionbook getselectors return empty/unstable values in sample run.- Runtime behavior conflicts with expected mechanism (e.g., virtualized container, delayed hydration).
Step 4: Generate Playwright script
Write a standalone Playwright script (extract_<domain>_<slug>.cjs) that:
- Navigates to the target URL.
- Waits for the correct readiness signal (not just
load— see mechanisms above). - Handles the detected mechanism (virtual scroll, pagination, etc.).
- Extracts data into structured objects.
- Writes output to disk (
JSON.stringify/ CSV). - Closes the browser.
- Enforces guardrails (
maxPages,maxScrolls, timeout budget) to avoid infinite loops.
Script template:
// extract_<domain>_<slug>.cjs
// Generated by Actionbook extract skill
// Usage: node extract_<domain>_<slug>.cjs
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('<URL>', { waitUntil: 'domcontentloaded' });
// -- wait for readiness --
await page.waitForSelector('<container>', { state: 'visible' });
// -- extract --
const data = await page.$$eval('<item-selector>', items =>
items.map(el => ({
// fields mapped from user request
}))
);
// -- output --
const fs = require('fs');
fs.writeFileSync('output.json', JSON.stringify(data, null, 2));
console.log(`Extracted ${data.length} items → output.json`);
await browser.close();
})();
Step 5: Execute and validate
Run the script to confirm it works:
node extract_<domain>_<slug>.cjs
Validation rules:
| Check | Pass condition |
|---|---|
| Script exits 0 | No runtime errors |
| Output file exists | Non-empty file written |
| Record count > 0 | At least one item extracted |
| No null/empty fields | Every declared field has a value in ≥ 90% of records |
| Data matches page | Spot-check first and last record against actionbook browser text |
If validation fails, inspect the output, adjust selectors or wait strategy, and re-run.
Step 6: Deliver
Present to the user:
- Script path — the
.cjsfile they can re-run anytime. - Data path — the output JSON/CSV file.
- Record count — how many items were extracted.
- Notes — any mechanism-specific caveats (e.g., "this site uses infinite scroll; the script scrolls up to 50 pages by default").
Output Contract
Every extract invocation produces:
| Artifact | Path | Format |
|---|---|---|
| Playwright script | ./extract_<domain>_<slug>.cjs | Standalone Node.js script using playwright |
| Extracted data | ./output.json (default) or user-specified path | JSON array of objects (default), CSV, or user-specified |
The script must be re-runnable — a user should be able to execute it later without Actionbook installed, as long as Node.js + Playwright are available in the runtime environment.
Selector Priority
When multiple selector types are available from actionbook get:
| Priority | Type | Reason |
|---|---|---|
| 1 | data-testid | Stable, test-oriented, rarely changes |
| 2 | aria-label | Accessibility-driven, semantically meaningful |
| 3 | CSS selector | Structural, may break on redesign |
| 4 | XPath | Last resort, most brittle |
Error Handling
| Error | Action |
|---|---|
actionbook search returns no results | Fall back to snapshot + screenshot |
| Selector returns 0 elements | Re-snapshot, compare with screenshot, update selector |
| Script times out | Add longer waitForTimeout, check for anti-bot measures |
| Partial data (some fields empty) | Check if content is lazy-loaded; add scroll/wait |
| Anti-bot / CAPTCHA | Inform user; suggest running with headless: false or using their own browser session via actionbook setup extension mode |