Slow Query Troubleshooting: Identification, Profiling, and Optimization


Slow Query Troubleshooting: Identification, Profiling, and Optimization

Slow queries are the most common database performance problem. This article presents a systematic workflow: from identifying the slowest queries through profiling to implementing and verifying optimizations.

Step 1: Identify Slow Queries

Using pg_stat_statements

Enable the extension and query for the worst performers:




CREATE EXTENSION IF NOT EXISTS pg_stat_statements;




-- Top 10 queries by total execution time


SELECT queryid,


LEFT(query, 100) AS query_preview,


calls,


ROUND(total_exec_time::numeric, 2) AS total_ms,


ROUND(mean_exec_time::numeric, 2) AS avg_ms,


ROUND(min_exec_time::numeric, 2) AS min_ms,


ROUND(max_exec_time::numeric, 2) AS max_ms,


ROUND(stddev_exec_time::numeric, 2) AS stddev_ms,


rows,


shared_blks_hit,


shared_blks_read,


shared_blks_dirtied,


shared_blks_written


FROM pg_stat_statements


WHERE query NOT LIKE '%pg_stat%'


ORDER BY total_exec_time DESC


LIMIT 20;





Focus on queries with:


* High `total_exec_time`: Consuming the most database time overall.

* High `mean_exec_time`: Individually slow queries.

* High `stddev_exec_time`: Performance varies wildly (plan instability).

* Low cache hit ratio: `(shared_blks_hit / NULLIF(shared_blks_hit + shared_blks_read, 0)) < 0.99`.


Using pg_stat_activity




-- Current running queries


SELECT pid, now() - query_start AS duration, state, query


FROM pg_stat_activity


WHERE state = 'active'


AND query_start < now() - interval '5 seconds'


ORDER BY duration DESC;





External Tools




# pgBadger: parse PostgreSQL logs for slow queries


pgbadger /var/log/postgresql/postgresql.log -o report.html




# pgFincore: analyze cache hit rates


# pgbouncer logs for pool-level insights





Step 2: Profile the Problem Query

Once identified, profile the slow query:




-- Reset statistics for this specific query


SELECT pg_stat_statements_reset();




-- Run the query once


EXPLAIN (ANALYZE, BUFFERS, TIMING)


SELECT u.email, COUNT(o.id) AS order_count


FROM users u


LEFT JOIN orders o ON o.user_id = u.id


WHERE u.created_at > '2026-01-01'


GROUP BY u.email


ORDER BY order_count DESC


LIMIT 100;





What to Look For in the Plan


* **Seq Scan on a large table** when only a few rows are needed → Add an index.

2\. **Large discrepancy between estimated rows and actual rows** → Run `ANALYZE`. 3\. **Sort node with high memory or disk** → Increase `work_mem` or add an index. 4\. **Nested Loop with many iterations** → May need a different join strategy. 5\. **"Rows Removed by Filter" is much larger than returned rows** → Add partial or better index.

Plan Analysis Example




Sort (cost=1234.56..1289.01 rows=21780 width=42)


Sort Key: (count(o.id)) DESC


Sort Method: external merge Disk: 1234kB


-> HashAggregate (cost=456.78..789.01 rows=21780 width=42)


-> Hash Join (cost=123.45..345.67 rows=22000 width=34)


Hash Cond: (u.id = o.user_id)


-> Seq Scan on users u (cost=0.00..123.45 rows=3000 width=26)


Filter: (created_at > '2026-01-01')


-> Hash (cost=67.89..67.89 rows=4567 width=16)


-> Seq Scan on orders o (cost=0.00..67.89 rows=4567 width=16)





Problems identified:


* `Sort Method: external merge Disk`: Sort spilled to disk, `work_mem` too low.

* `Seq Scan on orders`: No index on `orders.user_id`.

* The hash join will work well after the orders index is added.


Step 3: Implement the Fix

Add Missing Index




-- The most common fix


CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders (user_id);





Optimize the Query




-- Before (commented for reference)


SELECT u.email, COUNT(o.id) AS order_count


FROM users u LEFT JOIN orders o ON o.user_id = u.id


WHERE u.created_at > '2026-01-01'


GROUP BY u.email


ORDER BY order_count DESC;




-- After: use EXISTS if you just need to check existence


SELECT u.email,


(SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id) AS order_count


FROM users u


WHERE u.created_at > '2026-01-01'


ORDER BY order_count DESC;





Rewrite Complex Joins




-- Before: slow correlated subquery


SELECT p.*,


(SELECT AVG(rating) FROM reviews WHERE product_id = p.id) AS avg_rating,


(SELECT COUNT(*) FROM orders WHERE product_id = p.id) AS order_count


FROM products p


WHERE p.category = 'electronics';




-- After: use LATERAL JOIN


SELECT p.*, r.avg_rating, o.order_count


FROM products p


LEFT JOIN LATERAL (


SELECT AVG(rating) AS avg_rating


FROM reviews WHERE product_id = p.id


) r ON true


LEFT JOIN LATERAL (


SELECT COUNT(*) AS order_count


FROM orders WHERE product_id = p.id


) o ON true


WHERE p.category = 'electronics';





Step 4: Verify the Improvement




EXPLAIN (ANALYZE, BUFFERS)


SELECT u.email, COUNT(o.id) AS order_count


FROM users u


LEFT JOIN orders o ON o.user_id = u.id


WHERE u.created_at > '2026-01-01'


GROUP BY u.email


ORDER BY order_count DESC


LIMIT 100;


-- Compare: total cost, actual time, buffers, sort method





Before and After Comparison

| Metric | Before | After | Improvement | |--------|--------|-------|-------------| | Execution time | 12.4 seconds | 0.3 seconds | 97% | | Buffers | 84,000 shared hit | 2,500 shared hit | 97% | | Sort method | external merge Disk | quicksort memory | In-memory | | Scan type | Seq Scan on orders | Index Scan | Added index |

Step 5: Prevent Regression

Capture Baseline




-- Create a baseline from pg_stat_statements


SELECT queryid, query, mean_exec_time, calls, rows


FROM pg_stat_statements


ORDER BY queryid;




-- Store in a monitoring table for trend analysis


CREATE TABLE query_performance_baseline AS


SELECT now() AS captured_at, *


FROM pg_stat_statements;





Set Up Alerts




-- Create a function to check for regressions


CREATE OR REPLACE FUNCTION check_query_regression()


RETURNS TABLE(query TEXT, avg_time_ms NUMERIC, calls BIGINT) AS $$


SELECT left(query, 200), mean_exec_time, calls


FROM pg_stat_statements


WHERE mean_exec_time > (


SELECT mean_exec_time * 2


FROM query_performance_baseline qpb


WHERE qpb.queryid = pg_stat_statements.queryid


)


AND calls > 100


ORDER BY mean_exec_time DESC


LIMIT 10;


$$ LANGUAGE sql;





Optimization Workflow Summary


* **Identify**: Find the top slow queries (pg_stat_statements).

2\. **Profile**: Run EXPLAIN (ANALYZE, BUFFERS, TIMING). 3\. **Diagnose**: Identify the bottleneck node in the plan. 4\. **Fix**: Add index, rewrite query, update statistics, or tune parameters. 5\. **Verify**: Re-run EXPLAIN ANALYZE to confirm improvement. 6\. **Monitor**: Track query performance over time for regression.

Most slow queries are fixed by adding the right index. When that is not enough, analyze the plan for unexpected join strategies, inefficient scans, or plan instability. A systematic approach prevents guessing and ensures each optimization has a measurable impact.