Legacy Archaeologist
You explore the ruins of old codebases to find treasure (logic) and traps (bugs).
When to use
-
"Figure out how this old monolith works."
-
"Document this legacy project."
-
"Where is the logic for X handled in this mess?"
-
"Plan a migration away from this legacy system."
Instructions
-
Discovery:
-
Map the entry points (main functions, listeners).
-
Trace data flow from input to output/database.
-
Identify dependencies (imports, API calls).
-
Assessment:
-
Flag "Dead Code" (functions never called).
-
Identify "Hot Spots" (modules touched by everything else).
-
Reporting:
-
Create a "Map" of the system architecture as it exists (not as it should be).
-
Write a plan for incremental refactoring or strangler fig patterns.
Examples
- Analyzing Entry Points and Data Flow
analysis_script.py
""" Script to analyze a legacy codebase and identify entry points """ import ast import os from collections import defaultdict
class CodeAnalyzer(ast.NodeVisitor): def init(self): self.functions = [] self.classes = [] self.imports = [] self.calls = defaultdict(list)
def visit_FunctionDef(self, node):
self.functions.append({
'name': node.name,
'line': node.lineno,
'args': [arg.arg for arg in node.args.args],
'decorators': [d.id if isinstance(d, ast.Name) else str(d) for d in node.decorator_list]
})
self.generic_visit(node)
def visit_ClassDef(self, node):
self.classes.append({
'name': node.name,
'line': node.lineno,
'bases': [b.id if isinstance(b, ast.Name) else str(b) for b in node.bases]
})
self.generic_visit(node)
def visit_Import(self, node):
for alias in node.names:
self.imports.append(alias.name)
self.generic_visit(node)
def visit_Call(self, node):
if isinstance(node.func, ast.Name):
self.calls[node.func.id].append(node.lineno)
self.generic_visit(node)
def analyze_file(filepath): with open(filepath, 'r', encoding='utf-8') as f: try: tree = ast.parse(f.read()) analyzer = CodeAnalyzer() analyzer.visit(tree) return analyzer except SyntaxError: return None
def find_entry_points(directory): """Find potential entry points in the codebase""" entry_points = []
for root, dirs, files in os.walk(directory):
for file in files:
if file.endswith('.py'):
filepath = os.path.join(root, file)
analyzer = analyze_file(filepath)
if analyzer:
# Check for main entry point
if any(f['name'] == 'main' for f in analyzer.functions):
entry_points.append({
'file': filepath,
'type': 'main_function',
'functions': analyzer.functions
})
# Check for Flask/Django routes
for func in analyzer.functions:
if any('route' in d or 'app.route' in d for d in func['decorators']):
entry_points.append({
'file': filepath,
'type': 'web_route',
'function': func['name'],
'line': func['line']
})
return entry_points
Usage
if name == 'main': entry_points = find_entry_points('./legacy_app')
print("=== ENTRY POINTS FOUND ===")
for ep in entry_points:
print(f"\n{ep['type'].upper()}: {ep['file']}")
if 'function' in ep:
print(f" Function: {ep['function']} (line {ep['line']})")
2. Dependency Graph and Hot Spots
dependency_mapper.py
""" Create a dependency graph to identify hot spots """ import os import re from collections import defaultdict, Counter
def extract_imports(filepath): """Extract all imports from a Python file""" imports = [] try: with open(filepath, 'r', encoding='utf-8') as f: content = f.read() # Match import statements import_pattern = r'^(?:from\s+(\S+)\s+import|import\s+(\S+))' matches = re.finditer(import_pattern, content, re.MULTILINE) for match in matches: module = match.group(1) or match.group(2) imports.append(module.split('.')[0]) except: pass return imports
def build_dependency_graph(directory): """Build a graph of file dependencies""" graph = defaultdict(set) files = {}
# Collect all Python files
for root, dirs, filenames in os.walk(directory):
for filename in filenames:
if filename.endswith('.py'):
filepath = os.path.join(root, filename)
module_name = filepath.replace(directory, '').replace('/', '.').replace('.py', '').strip('.')
files[module_name] = filepath
# Get imports
imports = extract_imports(filepath)
for imp in imports:
if imp in files or imp.startswith('.'):
graph[module_name].add(imp)
return graph, files
def find_hot_spots(graph): """Identify modules that are imported most frequently""" import_counts = Counter()
for module, dependencies in graph.items():
for dep in dependencies:
import_counts[dep] += 1
return import_counts.most_common(10)
Usage
graph, files = build_dependency_graph('./legacy_app') hot_spots = find_hot_spots(graph)
print("=== HOT SPOTS (Most Imported Modules) ===") for module, count in hot_spots: print(f"{module}: imported {count} times") if module in files: print(f" Location: {files[module]}")
- Refactoring Plan Template
Legacy System Refactoring Plan
Current State Assessment
System Overview
- Language/Framework: Python 2.7, Flask 0.10
- Database: MySQL 5.5
- Deployment: Manual deployment via FTP
- Lines of Code: ~45,000
- Last Major Update: 2015
Architecture Map
┌─────────────┐ │ Nginx │ └──────┬──────┘ │ ┌──────▼──────────┐ │ Flask App │ │ (monolith) │ └──────┬──────────┘ │ ┌──────▼──────────┐ │ MySQL DB │ └─────────────────┘
Entry Points Identified
app.py:main()- Application startuproutes/api.py- 15 API endpointsroutes/web.py- 8 web page routescron/daily_jobs.py- Scheduled tasks
Hot Spots (High Coupling)
utils/helpers.py- Imported by 47 modulesmodels/user.py- Imported by 32 modulesdb/connection.py- Imported by 28 modules
Dead Code Identified
legacy/old_api.py- Not called anywhereutils/deprecated.py- Marked as deprecated 3 years agotests/- Empty directory
Technical Debt
- No automated tests
- Hardcoded configuration
- SQL injection vulnerabilities in 3 endpoints
- No logging framework
- Python 2.7 (EOL)
Refactoring Strategy: Strangler Fig Pattern
Phase 1: Foundation (Weeks 1-4)
Goal: Set up modern infrastructure without breaking existing system
- Set up Python 3.11 environment
- Implement comprehensive logging
- Add monitoring (Prometheus + Grafana)
- Create CI/CD pipeline
- Set up automated testing framework
- Migrate configuration to environment variables
Risk: Low - No changes to existing code
Phase 2: Security Fixes (Weeks 5-6)
Goal: Address critical security vulnerabilities
- Fix SQL injection in
/api/search,/api/users,/api/reports - Implement input validation
- Add rate limiting
- Update dependencies with known CVEs
Risk: Medium - Requires code changes but isolated
Phase 3: Extract Authentication Service (Weeks 7-10)
Goal: Create new microservice for auth, route new requests there
- Build new Auth Service (Python 3.11 + FastAPI)
- Implement JWT-based authentication
- Add comprehensive tests (>80% coverage)
- Deploy alongside legacy app
- Route new user signups to new service
- Gradually migrate existing users
Risk: Medium - Dual-write period requires careful handling
Phase 4: Database Migration (Weeks 11-14)
Goal: Migrate to PostgreSQL with zero downtime
- Set up PostgreSQL instance
- Create migration scripts
- Implement dual-write (MySQL + PostgreSQL)
- Verify data consistency
- Switch reads to PostgreSQL
- Decommission MySQL
Risk: High - Data migration always risky
Phase 5: API Modernization (Weeks 15-20)
Goal: Rewrite API endpoints one by one
For each endpoint:
- Write comprehensive tests for current behavior
- Rewrite in new FastAPI service
- Deploy behind feature flag
- A/B test old vs new
- Monitor error rates and performance
- Gradually roll out to 100%
Risk: Medium - Controlled rollout minimizes impact
Phase 6: Decommission Legacy (Weeks 21-24)
Goal: Remove old codebase
- Verify all traffic on new services
- Archive legacy code
- Update documentation
- Celebrate! 🎉
Risk: Low - By this point, legacy is unused
Success Metrics
- Zero downtime during migration
- <5% increase in error rate during any phase
- Improved response times (target: 50% reduction)
- Test coverage >80% on new code
- All critical security issues resolved
Rollback Plan
Each phase has a rollback strategy:
- Phase 1-2: Revert infrastructure changes
- Phase 3-6: Feature flags allow instant rollback to legacy