Performance Profiling: perf, Flamegraphs, py-spy, pprof

Introduction

Performance profiling identifies where your application spends its time — CPU, memory, I/O, or blocking. Without profiling, optimization is guesswork. This article covers four profiling approaches: perf for system-level Linux profiling, flamegraphs for visualization, py-spy for Python without code changes, and pprof for Go applications.

perf (Linux Profiler)

The built-in Linux profiler for CPU, hardware events, and tracepoints:

# CPU profiling

perf record -F 99 -g ./myapp # Sample at 99Hz with call graphs

perf record -F 99 -p PID -g --sleep 30 # Profile running process for 30s

perf report --stdio # Text report

perf report -g graph # Call graph report

# Common events

perf stat ./myapp # Execution statistics

perf stat -e cache-misses ./myapp # Cache miss analysis

perf stat -e branch-misses ./myapp # Branch prediction

perf stat -e context-switches -p PID # Context switch monitoring

# Hardware event sampling

perf record -e cycles -F 99 -a -g --sleep 10 # System-wide CPU sampling

# Tracepoints

perf record -e sched:sched_switch -a -g # Context switch tracing

perf record -e syscalls:sys_enter_write -a # Write syscall tracing

# Top-like live view

perf top -p PID

perf top -e cache-misses

# Generate flamegraph data

perf script > out.perf

**Key metrics**: `cycles` for CPU time, `cache-misses` for memory bottleneck detection, `context-switches` for contention issues.

Flamegraphs

Brendan Gregg's visualization for profiler output:

# Install FlameGraph tools

git clone https://github.com/brendangregg/FlameGraph

# Generate flamegraph from perf data

perf script | ./FlameGraph/stackcollapse-perf.pl > out.folded

./FlameGraph/flamegraph.pl out.folded > flamegraph.svg

# Generate differential flamegraph (before/after)

# After optimization:

perf script | ./FlameGraph/stackcollapse-perf.pl > optimized.folded

./FlameGraph/difffolded.pl before.folded optimized.folded | ./FlameGraph/flamegraph.pl > diff.svg

**Reading flamegraphs**: The x-axis shows stack profile population (not time). Each rectangle is a function call; wider rectangles mean more CPU time. The y-axis is stack depth. Look for wide top rectangles — those are the hot functions.

**For other languages**:

# JavaScript (Node.js)

node --perf-basic-prof app.js

perf script | ./FlameGraph/stackcollapse-perf.pl > out.folded

# Python with py-spy

py-spy record -o profile.svg --pid $PID

# Go with pprof

go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30

py-spy

Sampling profiler for Python without modifying code:

# Installation

pip install py-spy

# Profile a running process

py-spy record -o profile.svg --pid 12345

py-spy record -o profile.svg -- python myapp.py

# Top-like live view

py-spy top --pid 12345

# Dump current stack traces

py-spy dump --pid 12345

# Profile specific duration

py-spy record -o profile.svg --pid 12345 --duration 30

# With subprocesses

py-spy record -o profile.svg -- python myapp.py --subprocesses

# Native frames

py-spy record --native -o profile.svg --pid 12345

# Save raw data for later analysis

py-spy record -o profile.raw --pid 12345 --format raw

**Key advantages**: No code changes required, works with running processes, safe for production (read-only), native code frame support.

pprof (Go)

Go's built-in profiling tool:

package main

import (

"net/http"

_ "net/http/pprof"

)

func main() {

// Start pprof HTTP server

go func() {

http.ListenAndServe("localhost:6060", nil)

}()

// Your application code...

}

# Collect profiles

go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30 # CPU

go tool pprof http://localhost:6060/debug/pprof/heap # Memory

go tool pprof http://localhost:6060/debug/pprof/goroutine # Goroutines

go tool pprof http://localhost:6060/debug/pprof/block # Blocking

go tool pprof http://localhost:6060/debug/pprof/mutex # Mutex contention

# Interactive exploration

go tool pprof cpu.pprof

(pprof) top10 # Top 10 functions

(pprof) list myFunc # Source with line-level timing

(pprof) web # Open in browser (requires graphviz)

(pprof) pdf # Generate PDF

(pprof) peek myFunc # Caller/callee view

# Web interface

go tool pprof -http=:8080 cpu.pprof

# Allocations profiling

go tool pprof -http=:8080 http://localhost:6060/debug/pprof/allocs

# Compare profiles

go tool pprof -http=:8080 -diff_base=before.pprof after.pprof

Profiling Workflow

# 1. Identify the problem (slow response, high CPU, OOM)

# 2. Profile without optimization

perf record -F 99 -p $(pgrep myapp) -g --sleep 30

# 3. Generate flamegraph

perf script | stackcollapse-perf.pl | flamegraph.pl > before.svg

# 4. Make optimization

# 5. Profile again with same parameters

perf record -F 99 -p $(pgrep myapp) -g --sleep 30

perf script | stackcollapse-perf.pl | flamegraph.pl > after.svg

# 6. Create differential flamegraph

./difffolded.pl before.folded after.folded | flamegraph.pl > diff.svg

Comparison

|------|----------|----------|----------|

| pprof | Go | Low | Go CPU, memory, goroutines |

Recommendations

* **Initial investigation**: Use `perf top` to quickly identify CPU hotspots.

* **Detailed analysis**: Collect perf data and generate flamegraphs for visual hotspot identification.

* **Python profiling**: Use py-spy for production-safe sampling without code changes.

* **Go profiling**: Use pprof with its web interface for interactive exploration.

* **Comparison**: Use differential flamegraphs to verify optimization impact.

Profiling is an iterative process: identify hotspots, form a hypothesis, make a change, and re-profile to verify improvement. Flamegraphs make this loop faster by providing immediate visual feedback on where time is spent.

Performance Profiling: perf, Flamegraphs, py-spy, pprof

Performance Profiling: perf, Flamegraphs, py-spy, pprof

Introduction

perf (Linux Profiler)

Flamegraphs

py-spy

pprof (Go)

Profiling Workflow

Comparison

Recommendations

Related Articles