Ways to speed up ts_rank

Shane Hathaway <shane@xxxxxxxxxxxxxxx> · Tue, 09 Oct 2012 15:38:52 -0600

Hello,

First let me say thanks for a fantastic database system.  The hard work 
this community has put into Postgres really shows.

A client of mine has been using Postgres quite effectively for a while 
now, but has recently hit a performance issue with text index queries, 
specifically when using ts_rank or ts_rank_cd functions.

The database has a text index of around 200,000 documents. 
Investigation revealed that text queries are slow only when using 
ts_rank or ts_rank_cd.  Without a ts_rank function, any query is 
answered within 200ms or so; with ts_rank function, queries take up to 
30 seconds.  Deeper investigation using gprof showed that the problem is 
probably not ts_rank or ts_rank_cd, but the fact that those functions 
retrieve thousands of TOASTed tsvectors.

I estimate that the total size of the text index is 900 MB uncompressed 
(including the TOAST data).  It is a table that stores only a docid, a 
tsvector, and a couple other small fields.  Some queries cause ts_rank 
to retrieve 700 MB of that 900 MB.  That means each time a user types a 
simple query, Postgres retrieves and decompresses almost 700 MB of TOAST 
data.  That's the size of a CD.  No wonder it sometimes takes so long!

I tried using SET STORAGE to optimize access to the TOAST tuples, but 
that only made things worse.  Even with EXTERNAL storage, Postgres still 
has to copy the TOAST data in RAM and that takes time.  I also 
recompiled Postgres with a larger TOAST chunk size.  That did improve 
performance by about 20%, but I'm looking for a much bigger win than that.

The tsvectors are large because the documents are of variable quality. 
We can not rely on authors to provide good metadata; we really need the 
text index to just index everything in the documents.  I've seen how 
ts_rank can do its job with millions of documents in milliseconds, but 
that only works when the text index stores a small amount of data per 
document.  Short metadata just won't work for my client.  Also, my 
client can not simply stop using ts_rank.

I've thought of some possible solutions to speed up ts_rank:

1. Cache the whole tsvectors in RAM.  This would be different from the 
buffer cache, which apparently only knows how to cache the chunks of a 
TOASTed tsvector.  I have thought of 3 ways to do this:

    A. Write a C extension that maintains a cache of tsvectors in 
shared memory.  I think I would use each document's hash as the cache key.

    B. Add to Postgres the ability to cache any decompressed TOAST 
value.  This would probably involve expanding the SET STORAGE clause of 
ALTER TABLE so that adminstrators can configure which TOAST values are 
worth caching.

    C. Alter the buffer cache to be more aware of TOAST; make it cache 
whole TOAST values rather than chunks.  This would be invasive and might 
be a bad idea overall.

2. Maintain 2 tsvectors per row: one for querying, another for ranking. 
 The tsvector used for ranking would be trimmed to prevent TOASTing. 
This would speed up queries, but it may impact the quality of ranking.

3. Write a C extension that stores the tsvectors in memory-mapped files 
on disk, allowing us to at least take advantage of kernel-level caching.

What do you think?  Do one of these solutions pop out to you as a good 
idea, or have I overlooked some simpler solution?

Shane

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance