benchmark-e2e

Single-command pipeline that creates projects, exercises skill injection via claude --print , launches dev servers, verifies they work, analyzes conversation logs, and generates actionable improvement reports.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "benchmark-e2e" with this command: npx skills add vercel-labs/vercel-plugin/vercel-labs-vercel-plugin-benchmark-e2e

Benchmark E2E

Single-command pipeline that creates projects, exercises skill injection via claude --print , launches dev servers, verifies they work, analyzes conversation logs, and generates actionable improvement reports.

Quick Start

Full suite (9 projects, ~2-3 hours)

bun run scripts/benchmark-e2e.ts

Quick mode (first 3 projects, ~30-45 min)

bun run scripts/benchmark-e2e.ts --quick

Options:

Flag Description Default

--quick

Run only first 3 projects false

--base <path>

Override base directory ~/dev/vercel-plugin-testing

--timeout <ms>

Per-project timeout (forwarded to runner) 900000 (15 min)

Pipeline Stages

The orchestrator chains four stages sequentially, aborting on failure:

  • runner — Creates test dirs, installs plugin, runs claude --print with VERCEL_PLUGIN_LOG_LEVEL=trace

  • verify — Detects package manager, launches dev server, polls for 200 with non-empty HTML

  • analyze — Matches JSONL sessions to projects via run-manifest.json , extracts metrics

  • report — Generates report.md and report.json with scorecards and recommendations

Contracts

run-manifest.json

Written by the runner at <base>/results/run-manifest.json . Links all downstream stages to the same run.

interface BenchmarkRunManifest { runId: string; // UUID for this pipeline run timestamp: string; // ISO 8601 baseDir: string; // Absolute path to base directory projects: Array<{ slug: string; // e.g. "01-recipe-platform" cwd: string; // Absolute path to project dir promptHash: string; // SHA hash of the prompt text expectedSkills: string[]; }>; }

The analyzer and verifier read this manifest to correlate sessions precisely instead of guessing from directory listings.

events.jsonl

The orchestrator writes NDJSON events to <base>/results/events.jsonl tracking pipeline lifecycle:

// Each line is one JSON object: { "stage": "pipeline", "event": "start", "timestamp": "...", "data": { "baseDir": "...", "quick": false } } { "stage": "runner", "event": "start", "timestamp": "...", "data": { "script": "...", "args": [...] } } { "stage": "runner", "event": "complete", "timestamp": "...", "data": { "exitCode": 0, "durationMs": 120000 } } // On failure: { "stage": "verify", "event": "error", "timestamp": "...", "data": { "exitCode": 1, "durationMs": 5000, "slug": "04-conference-tickets" } } { "stage": "pipeline", "event": "abort", "timestamp": "...", "data": { "failedStage": "verify", "exitCode": 1, "slug": "04-conference-tickets" } }

report.json

Machine-readable report at <base>/results/report.json for programmatic consumption:

interface ReportJson { runId: string | null; timestamp: string; verdict: "pass" | "partial" | "fail"; gaps: Array<{ slug: string; expected: string[]; actual: string[]; missing: string[]; }>; recommendations: string[]; suggestedPatterns: Array<{ skill: string; // Skill that was expected but not injected glob: string; // Suggested pathPattern glob tool: string; // Tool name that should trigger injection }>; }

Overnight Automation Loop

Run the pipeline repeatedly with a cooldown between iterations:

while true; do bun run scripts/benchmark-e2e.ts sleep 3600 done

Each run produces timestamped report.json and report.md files. Compare across runs to track improvement.

Self-Improvement Cycle

The pipeline enables a closed feedback loop:

  • Run — bun run scripts/benchmark-e2e.ts exercises the plugin against realistic projects

  • Read gaps — report.json lists which skills were expected but never injected, with exact slugs

  • Apply fixes — Use suggestedPatterns entries (copy-pasteable YAML) to add missing frontmatter patterns; use recommendations to fix hook logic

  • Re-run — Execute the pipeline again to verify the gaps are closed

  • Compare — Diff report.json across runs: verdict should trend from "fail" → "partial" → "pass"

For overnight automation, combine with the loop above. Wake up to reports showing exactly what improved and what still needs work.

Prompt Table

Prompts never name specific technologies — they describe the product and features, letting the plugin infer which skills to inject.

Slug Expected Skills

01 recipe-platform auth, vercel-storage, nextjs

02 trivia-game vercel-storage, nextjs

03 code-review-bot ai-sdk, nextjs

04 conference-tickets payments, email, auth

05 content-aggregator cron-jobs, ai-sdk

06 finance-tracker cron-jobs, email

07 multi-tenant-blog routing-middleware, cms, auth

08 status-page cron-jobs, vercel-storage, observability

09 dog-walking-saas payments, auth, vercel-storage, env-vars

Cleanup

rm -rf ~/dev/vercel-plugin-testing

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

agent-browser

Browser automation CLI for AI agents. Use when the user needs to interact with websites, including navigating pages, filling forms, clicking buttons, taking screenshots, extracting data, testing web apps, or automating any browser task. Triggers include requests to "open a website", "fill out a form", "click a button", "take a screenshot", "scrape data from a page", "test this web app", "login to a site", "automate browser actions", or any task requiring programmatic web interaction.

Repository SourceNeeds Review
22.4K101.6K
vercel-labs
Coding

vercel-cli

No summary provided by upstream source.

Repository SourceNeeds Review
General

shadcn

No summary provided by upstream source.

Repository SourceNeeds Review