Python Profiling & Optimization
A structured, measurement-driven workflow for making Python code faster and leaner. The core principle: never optimize without measuring first, and never trust an optimization without measuring after.
Before you start
Read references/tools-cheatsheet.md for detailed command syntax for each profiling
tool. It covers installation, usage patterns, and output interpretation.
Phase 1: Understand the project
Before profiling anything, gather context:
- Python version — Check
pyproject.tomlforrequires-pythonandtarget-version. This determines which tools and stdlib features are available. - Package manager — Look for
uv.lock,poetry.lock,Pipfile.lock, orrequirements.txtto determine how dependencies are managed. - Existing benchmarks — Search for
pytest-benchmark,benchmarkdirectories, or profiling scripts already in the project. - Test suite — Understand how tests run so you can validate correctness after each optimization.
Phase 2: Establish a baseline
You cannot improve what you haven't measured. Before any optimization:
If the project has pytest-benchmark
Save a named baseline snapshot:
uv run pytest <benchmark_file> -m slow \
--benchmark-only --benchmark-disable-gc \
--benchmark-save=baseline
If the project lacks benchmarks
Create a minimal benchmark file targeting the code to optimize. Use pytest-benchmark
with pedantic() for stable, reproducible results:
@pytest.mark.slow()
class TestPerformance:
def test_hot_path(self, benchmark):
# Setup outside the measured region
obj = create_object()
benchmark.pedantic(obj.hot_method, rounds=10, iterations=1000)
Use pedantic() over the simple benchmark() call — it gives explicit control over
rounds and iterations, producing more stable measurements with lower variance.
Quick ad-hoc baseline (no benchmark framework)
For quick exploration before setting up proper benchmarks:
import cProfile
cProfile.run('function_to_profile()', sort='cumulative')
Phase 3: Identify bottlenecks
Use the right tool for the job. Start broad, then narrow down.
Decision tree
Is the problem CPU-bound or memory-bound?
CPU-bound:
Need a quick overview? .............. cProfile (stdlib, zero install)
Need per-line granularity? .......... line_profiler (uv pip install)
Can't modify code / need sampling? .. py-spy (uv pip install)
Want CPU + memory together? ......... scalene (uv pip install)
Memory-bound:
Quick stdlib check? ................. tracemalloc (stdlib, zero install)
Need detailed allocations/flamegraph? memray (uv pip install)
Want CPU + memory together? ......... scalene (uv pip install)
Tool installation
For tools not in the project's dependencies, install them as standalone tools that won't pollute the project. Always ask the user before installing:
# Temporary install (lost if venv is recreated)
uv pip install line-profiler
uv pip install py-spy
uv pip install memray
uv pip install scalene
# Or add as dev dependency (persists across venv recreations)
uv add --group dev line-profiler
Recommend uv pip install by default — profiling tools are typically used
temporarily during optimization work, not as permanent project dependencies.
Profiling workflow
-
Start broad with cProfile — Identify which functions consume the most time. Look at cumulative time (
cumtime) to find the call trees that matter. -
Narrow down with line_profiler — Once you know which function is hot, profile it line-by-line to find the exact bottleneck.
-
For production or running processes — Use py-spy to attach to a running process without modifying code or restarting.
-
For memory issues — Start with tracemalloc snapshots, graduate to memray for flamegraphs and detailed allocation tracking.
Phase 4: Optimize (one change at a time)
Each optimization must be:
- A single, focused change — Don't bundle multiple optimizations together. If one of them causes a regression, you won't know which.
- Measured immediately — Run benchmarks right after the change.
- Validated for correctness — Run the full test suite. A faster wrong answer is worse than a slow correct one.
Common Python optimization patterns
Listed roughly by impact and safety (safest first):
-
Algorithm/data structure — The highest-impact changes.
O(n)lookup →O(1)with a dict/set. Sorting when you only need min/max. Quadratic nested loops. -
Reduce allocations — Reuse objects instead of creating new ones in hot loops. Use
__slots__on frequently instantiated classes. Prefer tuples over lists for fixed-size sequences. -
Cache repeated work —
functools.lru_cacheorfunctools.cache(3.9+) for pure functions. Manual caching withdictfor methods.__hash__caching for objects used as dict keys. -
Avoid unnecessary copies —
str.join()instead of+=in loops. Generator expressions instead of list comprehensions when you only iterate once. -
Move work out of hot loops — Attribute lookups (
self.x→ local variable), method resolution, import-time computation. -
Use stdlib accelerators —
collections.dequefor queue operations,bisectfor sorted insertion,itertoolsfor iterator patterns. -
Leverage C extensions —
re.compile()for repeated regex,struct.pack()for binary data,array.arrayfor homogeneous numeric data.
After each optimization
# Measure
uv run pytest <benchmark_file> -m slow \
--benchmark-only --benchmark-disable-gc \
--benchmark-save=<optimization-label> \
--benchmark-compare=<baseline-number>
# Validate correctness
uv run pytest # full test suite
Log results
Keep a progress log documenting each optimization:
### Optimization N: <title>
| Benchmark | Before | After | Delta |
|-----------|--------|-------|-------|
| ... | ... | ... | ...% |
**Commit:** `<hash>`
**Description:** ...
**Tests pass:** yes/no
Phase 5: Deep investigation
When the broad tools aren't enough, go deeper.
pytest-benchmark + cProfile integration
Get per-function breakdown within a specific benchmark:
uv run pytest <benchmark_file>::<TestClass>::<test_name> \
-m slow --benchmark-only --benchmark-disable-gc \
--benchmark-cprofile=cumtime --benchmark-cprofile-top=30
IMPORTANT: By default, --benchmark-cprofile profiles a single iteration
of the benchmark function, which produces near-zero times for fast code (everything
shows 0.0000). Use --benchmark-cprofile-loops=N to run the profiled code N times,
giving cProfile enough samples to produce meaningful cumulative times:
uv run pytest <benchmark_file>::<TestClass>::<test_name> \
-m slow --benchmark-only --benchmark-disable-gc \
--benchmark-cprofile=cumtime --benchmark-cprofile-top=30 \
--benchmark-cprofile-loops=1000
Choose N so the total profiled time is at least 0.5–1s — this gives enough
resolution to distinguish real hotspots from noise. For very fast functions
(~100µs), use --benchmark-cprofile-loops=5000 or more.
Visual profiling
Generate .prof files and visualize with snakeviz or speedscope:
uv run pytest <benchmark_file>::<test_name> \
-m slow --benchmark-only --benchmark-disable-gc \
--benchmark-cprofile=cumtime --benchmark-cprofile-loops=1000 \
--benchmark-cprofile-dump=/tmp/bench
# Interactive flamegraph in browser
uv pip install snakeviz
uv run snakeviz /tmp/bench-<test_name>.prof
Memory flamegraphs with memray
uv pip install memray
uv run memray run -o /tmp/mem.bin script.py
uv run memray flamegraph /tmp/mem.bin -o /tmp/mem.html
open /tmp/mem.html # or xdg-open on Linux
Anti-patterns to avoid
- Premature optimization — Profile first. The bottleneck is almost never where you think it is.
- Micro-benchmarking in isolation — A function that's fast in isolation may be slow in context due to cache effects, GC pressure, or contention.
- Optimizing cold paths — Focus on code that runs frequently. A 10x speedup on code that runs once at startup is worth less than a 2x speedup on a hot loop.
- Breaking the API for speed — Prefer internal optimizations that don't change the public interface.
- Trusting a single measurement — Use
pedantic()with multiple rounds. Compare means AND standard deviations. A 5% improvement with 20% stddev is noise. - Bundling multiple changes — One optimization per commit. If you combine three changes and get a 15% speedup, you don't know which change helped (or if one actually regressed and the others compensated).
Checklist
Before declaring an optimization complete:
- Baseline benchmark saved before any changes
- Each optimization is a separate, focused change
- Benchmark comparison shows measurable improvement (beyond noise/stddev)
- Full test suite passes
- Progress log updated with before/after numbers
- No public API changes (or changes are documented)