Full-Text Search in PostgreSQL: tsvector, tsquery, GIN Indexes
Full-Text Search in PostgreSQL: tsvector, tsquery, GIN Indexes
PostgreSQL's full-text search (FTS) provides built-in text search capabilities without external dependencies. While not as feature-rich as Elasticsearch or Meilisearch, it handles a large class of search needs efficiently.
The FTS Pipeline
Full-text search in PostgreSQL follows this flow:
* **Parsing**: Break text into tokens (lexemes).
2\. **Normalization**: Convert tokens to a standard form via a dictionary (stemming, stop words). 3\. **Indexing**: Store the normalized tokens in a `tsvector`. 4\. **Querying**: Match a `tsquery` against the `tsvector` using a GIN index.
tsvector and tsquery
tsvector
A `tsvector` is a sorted list of distinct lexemes with positional information:
SELECT to_tsvector('english',
'The quick brown fox jumps over the lazy dog');
-- 'brown':3 'dog':9 'fox':4 'jump':5 'lazy':8 'quick':2
Notice "jumps" became "jump" (stemming), and "the", "over" are removed (stop words).
tsquery
A `tsquery` represents a search query:
SELECT to_tsquery('english', 'fox & dog');
-- 'fox' & 'dog'
SELECT to_tsquery('english', 'jump | run');
-- 'jump' | 'run'
SELECT to_tsquery('english', '!cat');
-- !'cat'
SELECT to_tsquery('english', 'quick <-> brown');
-- 'quick' <-> 'brown' (adjacency: quick followed by brown)
Basic Search
SELECT title, body
FROM articles
WHERE to_tsvector('english', body) @@ to_tsquery('english', 'database & performance');
Creating a Search Column
For production use, store the `tsvector` in a generated column to avoid repeated conversion:
ALTER TABLE articles
ADD COLUMN search_vector tsvector
GENERATED ALWAYS AS (
to_tsvector('english', coalesce(title, '') || ' ' || coalesce(body, ''))
) STORED;
CREATE INDEX idx_articles_search ON articles USING GIN (search_vector);
Now queries become:
SELECT title, body
FROM articles
WHERE search_vector @@ to_tsquery('english', 'postgresql & indexing');
Search Ranking
PostgreSQL offers ranking functions to sort results by relevance:
SELECT title,
ts_rank(search_vector, query) AS rank
FROM articles,
to_tsquery('english', 'postgresql & performance') AS query
WHERE search_vector @@ query
ORDER BY rank DESC
LIMIT 20;
`ts_rank` considers term frequency (TF) and inverse document frequency (IDF). A variant `ts_rank_cd` uses cover density, which rewards terms that appear close together:
SELECT title,
ts_rank_cd(search_vector, query) AS rank
FROM articles, to_tsquery('english', 'database & indexing') AS query
WHERE search_vector @@ query
ORDER BY rank DESC;
Highlighting
SELECT title,
ts_headline('english', body, query,
'StartSel=, StopSel=') AS highlighted
FROM articles, to_tsquery('english', 'performance & tuning') AS query
WHERE search_vector @@ query;
Multi-Column and Weighted Search
Assign different weights to columns for ranking:
ALTER TABLE articles ADD COLUMN search_weighted tsvector
GENERATED ALWAYS AS (
setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
setweight(to_tsvector('english', coalesce(body, '')), 'B')
) STORED;
CREATE INDEX idx_articles_weighted ON articles USING GIN (search_weighted);
The weight function `ts_rank` can use these weights:
SELECT title,
ts_rank('{0.1, 0.2, 0.4, 1.0}', search_weighted, query) AS rank
FROM articles, to_tsquery('english', 'indexing') AS query
WHERE search_weighted @@ query
ORDER BY rank DESC;
Dictionaries and Custom Configurations
PostgreSQL supports multiple text search dictionaries:
* `english_stem`: Snowball stemmer for English.
* `simple`: Lowercase conversion only.
* `unaccent`: Remove diacritics (requires extension).
* `thesaurus`: Synonym expansion.
Create a custom configuration:
CREATE TEXT SEARCH CONFIGURATION my_search (COPY = english);
CREATE TEXT SEARCH DICTIONARY my_thesaurus (
TEMPLATE = thesaurus,
DICTFILE = 'my_thesaurus.ths'
);
ALTER TEXT SEARCH CONFIGURATION my_search
ALTER MAPPING FOR asciiword
WITH my_thesaurus, english_stem;
SELECT to_tsvector('my_search', 'The database is performing well');
PostgreSQL vs Elasticsearch
| Aspect | PostgreSQL FTS | Elasticsearch | |--------|---------------|---------------| | Setup | Built-in, no extra service | Separate cluster, JVM | | Index freshness | Real-time within transaction | Near-real-time (refresh interval) | | Ranking | TF-IDF based | BM25, learning-to-rank | | Faceted search | GROUP BY with tsvector | Dedicated aggregation API | | Scale | Single node + replicas | Distributed by design | | Fuzzy search | pg_trgm extension | Built-in fuzzy matching | | Language support | 20+ languages via Snowball | 40+ language analyzers | | Query syntax | @@ tsquery operator | JSON-based Query DSL |
Performance Tuning
-- Analyze GIN index usage
SELECT relname, seq_scan, idx_scan
FROM pg_stat_all_indexes
WHERE indexrelname = 'idx_articles_search';
-- Track search index size
SELECT pg_size_pretty(pg_indexes_size('articles'));
For large datasets (millions of documents), consider:
* Partial indexes for recent documents if search is time-sensitive:
CREATE INDEX idx_recent_search ON articles USING GIN (search_vector)
WHERE published_at > now() - interval '30 days';
* Partitioning the table by date and creating GIN indexes per partition.
* Using `pg_trgm` for prefix/similarity matching as a complement to FTS:
CREATE INDEX idx_articles_title_trgm ON articles USING GIN (title gin_trgm_ops);
SELECT * FROM articles WHERE title % 'databas'; -- similarity search
PostgreSQL full-text search is adequate for the majority of applications: documentation sites, blog search, e-commerce product search, and knowledge bases. Consider dedicated search engines only when you need fuzzy matching, faceted aggregation across millions of documents, or distributed search at scale.