Performance Profiling: perf, Flamegraphs, py-spy, pprof


Introduction





Performance profiling identifies where your application spends its time — CPU, memory, I/O, or blocking. Without profiling, optimization is guesswork. This article covers four profiling approaches: perf for system-level Linux profiling, flamegraphs for visualization, py-spy for Python without code changes, and pprof for Go applications.





perf (Linux Profiler)





The built-in Linux profiler for CPU, hardware events, and tracepoints:






# CPU profiling


perf record -F 99 -g ./myapp # Sample at 99Hz with call graphs


perf record -F 99 -p PID -g --sleep 30 # Profile running process for 30s


perf report --stdio # Text report


perf report -g graph # Call graph report




# Common events


perf stat ./myapp # Execution statistics


perf stat -e cache-misses ./myapp # Cache miss analysis


perf stat -e branch-misses ./myapp # Branch prediction


perf stat -e context-switches -p PID # Context switch monitoring




# Hardware event sampling


perf record -e cycles -F 99 -a -g --sleep 10 # System-wide CPU sampling




# Tracepoints


perf record -e sched:sched_switch -a -g # Context switch tracing


perf record -e syscalls:sys_enter_write -a # Write syscall tracing




# Top-like live view


perf top -p PID


perf top -e cache-misses




# Generate flamegraph data


perf script > out.perf







**Key metrics**: `cycles` for CPU time, `cache-misses` for memory bottleneck detection, `context-switches` for contention issues.





Flamegraphs





Brendan Gregg's visualization for profiler output:






# Install FlameGraph tools


git clone https://github.com/brendangregg/FlameGraph




# Generate flamegraph from perf data


perf script | ./FlameGraph/stackcollapse-perf.pl > out.folded


./FlameGraph/flamegraph.pl out.folded > flamegraph.svg




# Generate differential flamegraph (before/after)


# After optimization:


perf script | ./FlameGraph/stackcollapse-perf.pl > optimized.folded


./FlameGraph/difffolded.pl before.folded optimized.folded | ./FlameGraph/flamegraph.pl > diff.svg







**Reading flamegraphs**: The x-axis shows stack profile population (not time). Each rectangle is a function call; wider rectangles mean more CPU time. The y-axis is stack depth. Look for wide top rectangles — those are the hot functions.





**For other languages**:



# JavaScript (Node.js)


node --perf-basic-prof app.js


perf script | ./FlameGraph/stackcollapse-perf.pl > out.folded




# Python with py-spy


py-spy record -o profile.svg --pid $PID




# Go with pprof


go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30







py-spy





Sampling profiler for Python without modifying code:






# Installation


pip install py-spy




# Profile a running process


py-spy record -o profile.svg --pid 12345


py-spy record -o profile.svg -- python myapp.py




# Top-like live view


py-spy top --pid 12345




# Dump current stack traces


py-spy dump --pid 12345




# Profile specific duration


py-spy record -o profile.svg --pid 12345 --duration 30




# With subprocesses


py-spy record -o profile.svg -- python myapp.py --subprocesses




# Native frames


py-spy record --native -o profile.svg --pid 12345




# Save raw data for later analysis


py-spy record -o profile.raw --pid 12345 --format raw







**Key advantages**: No code changes required, works with running processes, safe for production (read-only), native code frame support.





pprof (Go)





Go's built-in profiling tool:






package main




import (


"net/http"


_ "net/http/pprof"


)




func main() {


// Start pprof HTTP server


go func() {


http.ListenAndServe("localhost:6060", nil)


}()




// Your application code...


}








# Collect profiles


go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30 # CPU


go tool pprof http://localhost:6060/debug/pprof/heap # Memory


go tool pprof http://localhost:6060/debug/pprof/goroutine # Goroutines


go tool pprof http://localhost:6060/debug/pprof/block # Blocking


go tool pprof http://localhost:6060/debug/pprof/mutex # Mutex contention




# Interactive exploration


go tool pprof cpu.pprof


(pprof) top10 # Top 10 functions


(pprof) list myFunc # Source with line-level timing


(pprof) web # Open in browser (requires graphviz)


(pprof) pdf # Generate PDF


(pprof) peek myFunc # Caller/callee view




# Web interface


go tool pprof -http=:8080 cpu.pprof




# Allocations profiling


go tool pprof -http=:8080 http://localhost:6060/debug/pprof/allocs




# Compare profiles


go tool pprof -http=:8080 -diff_base=before.pprof after.pprof







Profiling Workflow






# 1. Identify the problem (slow response, high CPU, OOM)


# 2. Profile without optimization


perf record -F 99 -p $(pgrep myapp) -g --sleep 30




# 3. Generate flamegraph


perf script | stackcollapse-perf.pl | flamegraph.pl > before.svg




# 4. Make optimization


# 5. Profile again with same parameters


perf record -F 99 -p $(pgrep myapp) -g --sleep 30


perf script | stackcollapse-perf.pl | flamegraph.pl > after.svg




# 6. Create differential flamegraph


./difffolded.pl before.folded after.folded | flamegraph.pl > diff.svg







Comparison





| Tool | Language | Overhead | Best For |


|------|----------|----------|----------|


| perf | Any (system) | Low | CPU, cache misses, syscalls |


| flamegraphs | Any (visualization) | None | Visual hotspot identification |


| py-spy | Python | Very low | Production Python profiling |


| pprof | Go | Low | Go CPU, memory, goroutines |


| FlameGraph | Any (post-processing) | None | Comparative analysis |





Recommendations




* **Initial investigation**: Use `perf top` to quickly identify CPU hotspots.

* **Detailed analysis**: Collect perf data and generate flamegraphs for visual hotspot identification.

* **Python profiling**: Use py-spy for production-safe sampling without code changes.

* **Go profiling**: Use pprof with its web interface for interactive exploration.

* **Comparison**: Use differential flamegraphs to verify optimization impact.




Profiling is an iterative process: identify hotspots, form a hypothesis, make a change, and re-profile to verify improvement. Flamegraphs make this loop faster by providing immediate visual feedback on where time is spent.