Performance Profiling: perf, Flamegraphs, py-spy, pprof
Introduction
Performance profiling identifies where your application spends its time — CPU, memory, I/O, or blocking. Without profiling, optimization is guesswork. This article covers four profiling approaches: perf for system-level Linux profiling, flamegraphs for visualization, py-spy for Python without code changes, and pprof for Go applications.
perf (Linux Profiler)
The built-in Linux profiler for CPU, hardware events, and tracepoints:
# CPU profiling
perf record -F 99 -g ./myapp # Sample at 99Hz with call graphs
perf record -F 99 -p PID -g --sleep 30 # Profile running process for 30s
perf report --stdio # Text report
perf report -g graph # Call graph report
# Common events
perf stat ./myapp # Execution statistics
perf stat -e cache-misses ./myapp # Cache miss analysis
perf stat -e branch-misses ./myapp # Branch prediction
perf stat -e context-switches -p PID # Context switch monitoring
# Hardware event sampling
perf record -e cycles -F 99 -a -g --sleep 10 # System-wide CPU sampling
# Tracepoints
perf record -e sched:sched_switch -a -g # Context switch tracing
perf record -e syscalls:sys_enter_write -a # Write syscall tracing
# Top-like live view
perf top -p PID
perf top -e cache-misses
# Generate flamegraph data
perf script > out.perf
**Key metrics**: `cycles` for CPU time, `cache-misses` for memory bottleneck detection, `context-switches` for contention issues.
Flamegraphs
Brendan Gregg's visualization for profiler output:
# Install FlameGraph tools
git clone https://github.com/brendangregg/FlameGraph
# Generate flamegraph from perf data
perf script | ./FlameGraph/stackcollapse-perf.pl > out.folded
./FlameGraph/flamegraph.pl out.folded > flamegraph.svg
# Generate differential flamegraph (before/after)
# After optimization:
perf script | ./FlameGraph/stackcollapse-perf.pl > optimized.folded
./FlameGraph/difffolded.pl before.folded optimized.folded | ./FlameGraph/flamegraph.pl > diff.svg
**Reading flamegraphs**: The x-axis shows stack profile population (not time). Each rectangle is a function call; wider rectangles mean more CPU time. The y-axis is stack depth. Look for wide top rectangles — those are the hot functions.
**For other languages**:
# JavaScript (Node.js)
node --perf-basic-prof app.js
perf script | ./FlameGraph/stackcollapse-perf.pl > out.folded
# Python with py-spy
py-spy record -o profile.svg --pid $PID
# Go with pprof
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30
py-spy
Sampling profiler for Python without modifying code:
# Installation
pip install py-spy
# Profile a running process
py-spy record -o profile.svg --pid 12345
py-spy record -o profile.svg -- python myapp.py
# Top-like live view
py-spy top --pid 12345
# Dump current stack traces
py-spy dump --pid 12345
# Profile specific duration
py-spy record -o profile.svg --pid 12345 --duration 30
# With subprocesses
py-spy record -o profile.svg -- python myapp.py --subprocesses
# Native frames
py-spy record --native -o profile.svg --pid 12345
# Save raw data for later analysis
py-spy record -o profile.raw --pid 12345 --format raw
**Key advantages**: No code changes required, works with running processes, safe for production (read-only), native code frame support.
pprof (Go)
Go's built-in profiling tool:
package main
import (
"net/http"
_ "net/http/pprof"
)
func main() {
// Start pprof HTTP server
go func() {
http.ListenAndServe("localhost:6060", nil)
}()
// Your application code...
}
# Collect profiles
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30 # CPU
go tool pprof http://localhost:6060/debug/pprof/heap # Memory
go tool pprof http://localhost:6060/debug/pprof/goroutine # Goroutines
go tool pprof http://localhost:6060/debug/pprof/block # Blocking
go tool pprof http://localhost:6060/debug/pprof/mutex # Mutex contention
# Interactive exploration
go tool pprof cpu.pprof
(pprof) top10 # Top 10 functions
(pprof) list myFunc # Source with line-level timing
(pprof) web # Open in browser (requires graphviz)
(pprof) pdf # Generate PDF
(pprof) peek myFunc # Caller/callee view
# Web interface
go tool pprof -http=:8080 cpu.pprof
# Allocations profiling
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/allocs
# Compare profiles
go tool pprof -http=:8080 -diff_base=before.pprof after.pprof
Profiling Workflow
# 1. Identify the problem (slow response, high CPU, OOM)
# 2. Profile without optimization
perf record -F 99 -p $(pgrep myapp) -g --sleep 30
# 3. Generate flamegraph
perf script | stackcollapse-perf.pl | flamegraph.pl > before.svg
# 4. Make optimization
# 5. Profile again with same parameters
perf record -F 99 -p $(pgrep myapp) -g --sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > after.svg
# 6. Create differential flamegraph
./difffolded.pl before.folded after.folded | flamegraph.pl > diff.svg
Comparison
| Tool | Language | Overhead | Best For |
|------|----------|----------|----------|
| perf | Any (system) | Low | CPU, cache misses, syscalls |
| flamegraphs | Any (visualization) | None | Visual hotspot identification |
| py-spy | Python | Very low | Production Python profiling |
| pprof | Go | Low | Go CPU, memory, goroutines |
| FlameGraph | Any (post-processing) | None | Comparative analysis |
Recommendations
* **Initial investigation**: Use `perf top` to quickly identify CPU hotspots.
* **Detailed analysis**: Collect perf data and generate flamegraphs for visual hotspot identification.
* **Python profiling**: Use py-spy for production-safe sampling without code changes.
* **Go profiling**: Use pprof with its web interface for interactive exploration.
* **Comparison**: Use differential flamegraphs to verify optimization impact.
Profiling is an iterative process: identify hotspots, form a hypothesis, make a change, and re-profile to verify improvement. Flamegraphs make this loop faster by providing immediate visual feedback on where time is spent.