Code Clone Assistant
Detect code clones and guide refactoring using PMD CPD (exact duplicates) + Semgrep (patterns).
Tools
-
PMD CPD v7.17.0+: Exact duplicate detection
-
Semgrep v1.140.0+: Pattern-based detection
Tested: October 2025 - 30 violations detected across 3 sample files Coverage: ~3x more violations than using either tool alone
When to Use This Skill
Use this skill when:
-
Finding duplicate code in a codebase
-
Detecting DRY violations
-
Refactoring similar code patterns
-
Identifying copy-paste code
Why Two Tools?
PMD CPD and Semgrep detect different clone types:
Aspect PMD CPD Semgrep
Detects Exact copy-paste duplicates Similar patterns with variations
Scope Across files ✅ Within/across files (Pro only)
Matching Token-based (ignores formatting) Pattern-based (AST matching)
Rules ❌ No custom rules ✅ Custom rules
Result: Using both finds ~3x more DRY violations.
Clone Types
Type Description PMD CPD Semgrep
Type-1 Exact copies ✅ Default ✅
Type-2 Renamed identifiers ✅ --ignore-*
✅
Type-3 Near-miss with variations ⚠️ Partial ✅ Patterns
Type-4 Semantic clones (same behavior) ❌ ❌
Quick Start Workflow
Step 1: Detect exact duplicates (PMD CPD)
pmd cpd -d . -l python --minimum-tokens 20 -f markdown > pmd-results.md
Step 2: Detect pattern violations (Semgrep)
semgrep --config=clone-rules.yaml --sarif --quiet > semgrep-results.sarif
Step 3: Analyze combined results (Claude Code)
Parse both outputs, prioritize by severity
Step 4: Refactor (Claude Code with user approval)
Extract shared functions, consolidate patterns, verify tests
Accepted Exceptions (Known Intentional Duplication)
Not all code duplication is a problem. Some codebases deliberately use copy-and-adapt patterns where refactoring would be harmful. When running clone detection, always check for accepted exceptions before recommending refactoring.
When Duplication Is Acceptable
Pattern Why Acceptable Example
Generation-per-directory experiments Each generation is an immutable, self-contained experiment. Sharing code across generations would break provenance and make past experiments non-reproducible. SQL templates, sweep scripts where each gen{NNN}/ is independent
SQL templates with placeholder substitution SQL has no import/include mechanism. Templates use sed placeholder replacement (PLACEHOLDER ), not function calls. Extracting shared CTEs into separate files would break the single-file execution model. ClickHouse sweep templates sharing signal detection + metrics CTEs
Protocol/schema boilerplate Serialization formats, API contracts, and wire protocols require exact structure in each location. Abstracting them hides the contract. NDJSON telemetry line construction in wrapper scripts
Test fixtures and golden files Test data intentionally duplicates production patterns to verify behavior. Sharing fixtures creates brittle cross-test dependencies. Test setup code, expected output snapshots
How to Report Accepted Exceptions
When clone detection finds duplication that matches an accepted exception pattern:
-
Report it — always show the user what was found (lines, tokens, files)
-
Flag as accepted — explicitly state it matches a known exception pattern
-
Explain why — cite the specific reason refactoring is not recommended
-
Do NOT recommend refactoring — this is the key difference from actionable findings
Example output format:
Code Clone Analysis Results
PMD CPD Findings: Clone 1: 115 lines (575 tokens) — base_bars → signals CTEs gen610_template.sql:33 ↔ gen710_template.sql:38 Status: ACCEPTED EXCEPTION (generation-per-directory experiment) Reason: Each generation is immutable. Shared CTEs would break experiment provenance and reproducibility.
Clone 2: 36 lines (478 tokens) — metrics aggregation gen610_template.sql:207 ↔ gen710_template.sql:244 Status: ACCEPTED EXCEPTION (SQL template without include mechanism)
Actionable Findings: 0 Accepted Exceptions: 2
Project-Level Exception Configuration
Projects can declare accepted exception patterns in their CLAUDE.md :
Code Clone Exceptions
sql/gen*_template.sql— generation-per-directory experiments (immutable)scripts/gen*/— copy-and-adapt sweep scripts (no shared infrastructure)tests/fixtures/— intentional duplication for test isolation
When this section exists in a project's CLAUDE.md , the code-clone-assistant should check it before classifying findings.
Reference Documentation
For detailed information, see:
-
Detection Commands - PMD CPD and Semgrep command details
-
Complete Workflow - Detection, analysis, and presentation phases
-
Refactoring Strategies - Approaches for addressing violations
Troubleshooting
Issue Cause Solution
PMD CPD not found Not installed or not in PATH brew install pmd or download from PMD releases
Semgrep timeout Large codebase scan Use --exclude to limit scope
No duplicates detected minimum-tokens too high Lower --minimum-tokens value (try 15)
Too many false positives minimum-tokens too low Increase --minimum-tokens (try 30+)
Language not recognized Wrong -l flag Check PMD CPD supported languages list
SARIF parse error Semgrep output malformed Upgrade Semgrep to latest version
Memory error on large repo Java heap too small Set PMD_JAVA_OPTS=-Xmx4g
Missing clone rules file Custom rules not created Create clone-rules.yaml or use default config