Full-Text Search in PostgreSQL: tsvector, tsquery, GIN Indexes

PostgreSQL's full-text search (FTS) provides built-in text search capabilities without external dependencies. While not as feature-rich as Elasticsearch or Meilisearch, it handles a large class of search needs efficiently.

The FTS Pipeline

Full-text search in PostgreSQL follows this flow:

* **Parsing**: Break text into tokens (lexemes).

2\. **Normalization**: Convert tokens to a standard form via a dictionary (stemming, stop words). 3\. **Indexing**: Store the normalized tokens in a `tsvector`. 4\. **Querying**: Match a `tsquery` against the `tsvector` using a GIN index.

tsvector and tsquery

tsvector

A `tsvector` is a sorted list of distinct lexemes with positional information:

SELECT to_tsvector('english',

'The quick brown fox jumps over the lazy dog');

-- 'brown':3 'dog':9 'fox':4 'jump':5 'lazy':8 'quick':2

Notice "jumps" became "jump" (stemming), and "the", "over" are removed (stop words).

tsquery

A `tsquery` represents a search query:

SELECT to_tsquery('english', 'fox & dog');

-- 'fox' & 'dog'

SELECT to_tsquery('english', 'jump | run');

-- 'jump' | 'run'

SELECT to_tsquery('english', '!cat');

-- !'cat'

SELECT to_tsquery('english', 'quick <-> brown');

-- 'quick' <-> 'brown' (adjacency: quick followed by brown)

Basic Search

SELECT title, body

FROM articles

WHERE to_tsvector('english', body) @@ to_tsquery('english', 'database & performance');

Creating a Search Column

For production use, store the `tsvector` in a generated column to avoid repeated conversion:

ALTER TABLE articles

ADD COLUMN search_vector tsvector

GENERATED ALWAYS AS (

to_tsvector('english', coalesce(title, '') || ' ' || coalesce(body, ''))

) STORED;

CREATE INDEX idx_articles_search ON articles USING GIN (search_vector);

Now queries become:

SELECT title, body

FROM articles

WHERE search_vector @@ to_tsquery('english', 'postgresql & indexing');

Search Ranking

PostgreSQL offers ranking functions to sort results by relevance:

SELECT title,

ts_rank(search_vector, query) AS rank

FROM articles,

to_tsquery('english', 'postgresql & performance') AS query

WHERE search_vector @@ query

ORDER BY rank DESC

LIMIT 20;

`ts_rank` considers term frequency (TF) and inverse document frequency (IDF). A variant `ts_rank_cd` uses cover density, which rewards terms that appear close together:

SELECT title,

ts_rank_cd(search_vector, query) AS rank

FROM articles, to_tsquery('english', 'database & indexing') AS query

WHERE search_vector @@ query

ORDER BY rank DESC;

Highlighting

SELECT title,

ts_headline('english', body, query,

'StartSel=, StopSel=') AS highlighted

FROM articles, to_tsquery('english', 'performance & tuning') AS query

WHERE search_vector @@ query;

Multi-Column and Weighted Search

Assign different weights to columns for ranking:

ALTER TABLE articles ADD COLUMN search_weighted tsvector

GENERATED ALWAYS AS (

setweight(to_tsvector('english', coalesce(title, '')), 'A') ||

setweight(to_tsvector('english', coalesce(body, '')), 'B')

) STORED;

CREATE INDEX idx_articles_weighted ON articles USING GIN (search_weighted);

The weight function `ts_rank` can use these weights:

SELECT title,

ts_rank('{0.1, 0.2, 0.4, 1.0}', search_weighted, query) AS rank

FROM articles, to_tsquery('english', 'indexing') AS query

WHERE search_weighted @@ query

ORDER BY rank DESC;

Dictionaries and Custom Configurations

PostgreSQL supports multiple text search dictionaries:

* `english_stem`: Snowball stemmer for English.

* `simple`: Lowercase conversion only.

* `unaccent`: Remove diacritics (requires extension).

* `thesaurus`: Synonym expansion.

Create a custom configuration:

CREATE TEXT SEARCH CONFIGURATION my_search (COPY = english);

CREATE TEXT SEARCH DICTIONARY my_thesaurus (

TEMPLATE = thesaurus,

DICTFILE = 'my_thesaurus.ths'

);

ALTER TEXT SEARCH CONFIGURATION my_search

ALTER MAPPING FOR asciiword

WITH my_thesaurus, english_stem;

SELECT to_tsvector('my_search', 'The database is performing well');

PostgreSQL vs Elasticsearch

| Aspect | PostgreSQL FTS | Elasticsearch | |--------|---------------|---------------| | Setup | Built-in, no extra service | Separate cluster, JVM | | Index freshness | Real-time within transaction | Near-real-time (refresh interval) | | Ranking | TF-IDF based | BM25, learning-to-rank | | Faceted search | GROUP BY with tsvector | Dedicated aggregation API | | Scale | Single node + replicas | Distributed by design | | Fuzzy search | pg_trgm extension | Built-in fuzzy matching | | Language support | 20+ languages via Snowball | 40+ language analyzers | | Query syntax | @@ tsquery operator | JSON-based Query DSL |

Performance Tuning

-- Analyze GIN index usage

SELECT relname, seq_scan, idx_scan

FROM pg_stat_all_indexes

WHERE indexrelname = 'idx_articles_search';

-- Track search index size

SELECT pg_size_pretty(pg_indexes_size('articles'));

For large datasets (millions of documents), consider:

* Partial indexes for recent documents if search is time-sensitive:

CREATE INDEX idx_recent_search ON articles USING GIN (search_vector)

WHERE published_at > now() - interval '30 days';

* Partitioning the table by date and creating GIN indexes per partition.

* Using `pg_trgm` for prefix/similarity matching as a complement to FTS:

CREATE INDEX idx_articles_title_trgm ON articles USING GIN (title gin_trgm_ops);

SELECT * FROM articles WHERE title % 'databas'; -- similarity search

PostgreSQL full-text search is adequate for the majority of applications: documentation sites, blog search, e-commerce product search, and knowledge bases. Consider dedicated search engines only when you need fuzzy matching, faceted aggregation across millions of documents, or distributed search at scale.

Full-Text Search in PostgreSQL: tsvector, tsquery, GIN Indexes

Full-Text Search in PostgreSQL: tsvector, tsquery, GIN Indexes

Related Articles