PR Scorer

Score a pull request across three core quality dimensions — like Core Web Vitals, but for PRs. Each category is scored out of 100 with detailed justification, producing a rich scores.md report the author can act on immediately.

Scoring Categories

1. Code Quality (Always Scored)

Evaluates the craftsmanship and maintainability of the code changes. This is the baseline health check that applies to every PR regardless of what it does.

What to evaluate:

Criteria	Weight	What to look for
Readability & Naming	20%	Are variables, functions, classes named clearly and descriptively? Can someone unfamiliar with the codebase follow the logic without jumping between files? Are abbreviations avoided? Do boolean variables read as questions (`isReady`, `hasPermission`)? Are function names verbs that describe what they do? Are there any misleading names — variables that suggest one type but hold another?
Code Organization	20%	Are changes in the right files/modules? Is responsibility well-separated — does each file/function do one thing? Are new abstractions justified and not premature? Are imports organized and minimal (no unused imports)? Is there clear separation between business logic, data access, and presentation? Are new files placed in the correct directory following existing project conventions?
Error Handling	15%	Are ALL failure cases handled — not just the obvious ones? Are errors surfaced meaningfully with context (not swallowed silently or re-thrown without added info)? Are edge cases like null, undefined, empty arrays, empty strings, 0, and negative numbers accounted for? Are async operations wrapped in proper error handling? Do error messages help the developer debug the issue? Are there any catch blocks that silently swallow errors?
DRY & Reusability	15%	Is there any duplicated logic — even across different files? Could repeated patterns be extracted into shared utilities? Are existing helpers and utilities in the codebase being used where appropriate instead of reinventing them? Are magic numbers and hardcoded strings extracted into named constants? Could any new code be written as a reusable function or module?
Complexity Management	15%	Are functions small and focused (ideally under 30 lines)? Is nesting kept to 3 levels or fewer? Are conditionals clear and not deeply nested — could guard clauses or early returns simplify them? Are there any god functions that do too many things? Could complex conditionals be extracted into well-named helper functions? Are ternaries readable or should they be if/else? Is cyclomatic complexity reasonable?
Consistency & Patterns	10%	Do the changes follow the existing codebase conventions for formatting, naming, file structure, and patterns? Are similar operations done the same way throughout the diff? Are there places where the new code contradicts patterns established elsewhere in the same PR? Are language/framework idioms followed?
Comments & Documentation	5%	Are complex decisions, non-obvious logic, and workarounds explained with comments? Are public APIs documented with parameter types, return values, and usage examples? Is there any dead or commented-out code that should be removed? Are TODO/FIXME/HACK comments present without linked issues or timelines? Are JSDoc/docstrings present for exported functions?

2. Non-Functional Requirements (Conditionally Scored)

This category applies only when the changes touch areas where NFRs are relevant. Not every PR has NFR implications — a typo fix in a README doesn't need a performance review. The skill should detect when NFRs matter and score accordingly.

When NFRs apply — score this category when the diff includes any of:

API endpoint changes or additions
Database queries or schema changes
Authentication/authorization logic
File I/O or network calls
Frontend rendering logic or component changes
Infrastructure or deployment configuration
Dependency additions or upgrades
Data processing or transformation pipelines
Caching or state management changes

When NFRs don't apply — skip this category (mark as N/A) for:

Documentation-only changes
Pure test additions with no production code
Code style/formatting-only changes
Comment or typo fixes
Renaming/refactoring with no behavioral change

What to evaluate (when applicable):

Criteria	Weight	What to look for
Performance	25%	Are there unnecessary loops, redundant API calls, or N+1 queries? Are database queries indexed and selective (no `SELECT *` without reason)? Are large datasets paginated or streamed instead of loaded into memory? Are there synchronous operations that could be parallelized? Are expensive computations cached or memoized where appropriate? Are there unnecessary re-renders in frontend components? Are objects or arrays being copied/spread unnecessarily in hot paths?
Security	25%	Is ALL user input validated and sanitized at the boundary — not just at the point of use? Are secrets, tokens, and credentials kept out of code and config files committed to the repo? Are auth checks present on every endpoint/route that needs them? Is data exposure minimized — are API responses returning only the fields the client needs? Are SQL queries parameterized? Is file upload validated for type and size? Are CORS, CSP, and other security headers considered? Are there any hardcoded credentials, API keys, or tokens?
Scalability	15%	Will this work when data volume grows 10x–100x? Are there unbounded queries, unbounded loops, or unbounded in-memory collections? Is resource usage (memory, connections, file handles) bounded and cleaned up? Are there potential race conditions under concurrent access? Are database transactions scoped tightly? Is connection pooling used where applicable?
Backwards Compatibility	15%	Are existing APIs, contracts, and interfaces preserved? Are breaking changes to public APIs clearly documented and justified? Are database schema changes backwards-compatible (additive only) or do they include proper migrations? Are deprecated features marked and given a migration path? Will existing clients/consumers break?
Observability	10%	Is there structured logging at appropriate levels (not just console.log)? Are errors logged with enough context to trace and reproduce? Are key business events tracked with metrics or events? Can an on-call engineer diagnose a failure in production from the logs this code produces? Are request IDs or correlation IDs propagated?
Accessibility	10%	(Frontend only) Are semantic HTML elements used instead of generic divs/spans? Is keyboard navigation fully supported (tab order, focus management, escape to close)? Are ARIA attributes present and correct where semantic HTML isn't sufficient? Do images have alt text? Is color contrast sufficient? Are form inputs labeled? Do interactive elements have visible focus indicators?

Adjust the weights dynamically based on what's relevant. If there are no frontend changes, redistribute the Accessibility weight across the other criteria. If there are no API changes, redistribute Backwards Compatibility weight. The total should always sum to 100%.

3. Test Coverage (Always Scored)

Evaluates the testing quality and coverage introduced or affected by the PR. This isn't about running a coverage tool — it's about analyzing the diff to assess whether the changes are well-tested.

What to evaluate:

Criteria	Weight	What to look for
Test Presence	25%	Are there test files in the diff? Does every new feature, bug fix, or behavioral change have a corresponding test? Are new exported functions/endpoints/public methods covered? If no tests are present and production code was changed, this criterion should score very low.
Happy Path Coverage	20%	Are the primary use cases tested end-to-end? Do tests verify the actual output/behavior, not just that the function runs? Are all major code paths through new logic exercised? Do tests cover the typical inputs a real user would provide?
Edge Case Coverage	20%	Are boundary conditions tested — empty inputs, null/undefined, zero, negative numbers, maximum values, empty arrays/objects? Are error states tested — what happens when dependencies fail, when input is malformed, when permissions are denied? Are concurrent/race conditions considered where relevant? Are off-by-one scenarios covered?
Assertion Quality	15%	Are assertions specific — checking exact values, specific error types, precise state changes? Or are they weak — just checking truthiness, checking for no error, checking array length without verifying contents? Do assertions validate behavior and outcome, not implementation details? Are error assertions checking the right error type/message, not just "it threw something"?
Test Structure	10%	Are tests organized with clear describe/context blocks? Are test names descriptive enough to understand the scenario without reading the test body? Is setup/teardown clean and using proper beforeEach/afterEach hooks? Are test utilities and fixtures well-organized? Is there unnecessary test duplication that could be parameterized?
Integration vs Unit Balance	10%	Is there an appropriate mix for the type of changes? Are pure logic functions unit-tested? Are workflows, API routes, and multi-component interactions integration-tested? Are external dependencies mocked at the right boundary — not too deep (testing mocks) and not too shallow (missing integration issues)?

Workflow

Step 1: Identify the base branch

Determine the base branch that the current branch will be merged into.

Run these git commands to gather context:

git branch --show-current
git remote show origin | grep "HEAD branch"
git log --oneline -5

Use git remote show origin to find the default branch (usually main or master). If the repo has a common base like develop or staging, infer from the branch naming convention or recent merge targets.

If the base branch cannot be confidently determined (e.g., ambiguous remote setup, multiple candidates), ask the user which branch they intend to merge into. Don't guess — getting the base branch wrong means the diff will be wrong, which means the entire scoring will be wrong.

Step 2: Get the full diff

Fetch the diff between the current branch and the base branch:

git diff <base-branch>...HEAD --stat
git diff <base-branch>...HEAD

Use the three-dot diff (...) because it shows only the changes introduced on the current branch since it diverged from the base — this is what a PR reviewer will actually see.

If the diff is extremely large (thousands of lines), use --stat first to understand the scope, then read the most important changed files selectively. For very large diffs, focus on:

New files (they tell the story of what was added)
Modified core files (business logic, not config/lock files)
Deleted files (what was removed and why)

Skip noisy files like lock files, auto-generated code, or minified assets — they add nothing to the scoring.

Step 3: Analyze the diff thoroughly

Before scoring anything, deeply understand the changes. Scoring without understanding leads to generic, useless feedback.

You must determine all of the following before proceeding to scoring:

What changed? (files, functions, components, APIs)
Why might it have changed? (bug fix, new feature, refactor, performance)
How do the changes work together? (trace the logic across files)
What's the scope? (small targeted fix or broad refactor?)
What's the tech stack? (language, framework, testing libraries)
Are there breaking changes? (API changes, schema migrations, removed exports)
What files are test files vs production code? (identify by path conventions like __tests__/, *.test.*, *.spec.*, test/, etc.)
What new public API surface was added? (exported functions, new endpoints, public methods)
What error paths exist in the new code? (every conditional, every try/catch, every early return)
What inputs does the new code accept? (user input, API payloads, query params, file uploads — and how are they validated?)
What dependencies were added or changed? (new packages, version bumps, removed deps)
Are there any leftover artifacts? (debug logs, commented-out code, TODO/FIXME without linked issues, temporary workarounds)

Group related changes into logical units. Understanding the intent behind the changes is critical for fair scoring — a deliberate tradeoff that is documented with a comment should be scored differently from an unexplained shortcut.

Step 4: Determine NFR applicability

Before scoring, decide whether the Non-Functional Requirements category applies to this PR. Review the diff through the lens of the NFR applicability rules defined above.

If NFRs apply:

Note which specific NFR criteria are relevant (not all sub-criteria apply to every PR)
Adjust weights to redistribute from non-applicable sub-criteria
All three categories will be scored and the overall score is the weighted average of all three

If NFRs don't apply:

Mark NFRs as "N/A" in the report
The overall score becomes the weighted average of Code Quality and Test Coverage only
Briefly explain why NFRs were not scored (e.g., "This PR contains only documentation changes — NFR scoring is not applicable")

Step 5: Score each category

For each applicable category, go through every criterion in the evaluation table and assign a score. Be specific and evidence-based — every score needs justification rooted in the actual diff, not generic advice.

Scoring guidelines:

Range	Label	Meaning
90–100	Excellent	Exemplary code. Minor or no issues. Ready to merge as-is.
75–89	Good	Solid work with minor improvements possible. Merge-ready with optional polish.
50–74	Needs Improvement	Noticeable issues that should be addressed before merging.
25–49	Poor	Significant problems. Requires substantial rework.
0–24	Critical	Fundamental issues. Should not be merged in current state.

How to score:

Score what's in the diff. The diff is the only evidence. Do not give credit for code that exists elsewhere in the codebase unless it's part of the diff.
Be proportional to scope. A 5-line bug fix doesn't need a comprehensive test suite — but it should have a targeted test for the fix.
Consider intent. A refactoring PR that doesn't add features shouldn't lose Test Coverage points for not adding new tests (as long as existing tests still pass/are updated).
Don't double-penalize. If a function has poor naming AND poor error handling, score each criterion independently — don't let one issue drag down the other.
Every score must have justification rooted in the actual diff — cite specific files, functions, and line ranges. No score should appear without an explanation.

Step 6: Calculate the overall score

Compute the overall PR score as a weighted average:

When all three categories apply:

Code Quality: 40%
Non-Functional Requirements: 30%
Test Coverage: 30%

When NFRs are N/A:

Code Quality: 55%
Test Coverage: 45%

Round the final score to the nearest integer.

Step 7: Generate the scores.md report

Read the template at assets/scores-template.md (relative to this skill's directory) and fill it in with the scoring results.

Write the completed report to scores.md in the repository root directory.

Step 8: Present a summary

After writing scores.md, present a brief summary in the chat:

The three category scores (or two if NFR is N/A) as a quick visual
The overall score with its label
The top 2-3 most critical findings — the specific issues that must be fixed
Let the user know that the full detailed report is in scores.md

Keep the chat summary concise — the detailed report is in the file. The chat message is the "at a glance" view, like seeing your Lighthouse scores before diving into the full report.

Do not include the commit hash or commit message anywhere in the summary or the report.

pr-scorer

Safety Notice

Copy this and send it to your AI assistant to learn