Jeff Davis wrote: > On Thu, 2009-10-22 at 18:28 +0200, Jesper Krogh wrote: >> I somehow would expect the index-search to take advantage of the MCV's >> informations in the statistics that sort of translate it into a search >> and post-filtering (as PG's queryplanner usually does at the SQL-level). > > MCVs are full values that are found in columns or indexes -- you aren't > likely to have two entire documents that are exactly equal, so MCVs are > useless in your example. According to my testing, this is not the case and if it was the case, the queryplanner most likely wouldn't be able to plan this query correct: select id from ftstable where tsvectorcol @@ to_tsquery('commonterm') order by id limit 10; (into a index-scan on ID and select id from ftstable where tsvectorcol @@ to_tsquery('rareterm'); into a bitmap index scan on the tsvectorcol and a subsequent sort. This is indeed information on individual terms from the statistics that enable this. > I believe that stop words are a more common way of accomplishing what > you want to do, but they are slightly more limited: they won't be > checked at any level, and so are best used for truly common words like > "and". From your example, I assume that you still want the word checked, > but it's not selective enough to be usefully checked by the index. the terms are really common non-stop-words. > In effect, what you want are words that aren't searched (or stored) in > the index, but are included in the tsvector (so the RECHECK still > works). That sounds like it would solve your problem and it would reduce > index size, improve update performance, etc. I don't know how difficult > it would be to implement, but it sounds reasonable to me. > > The only disadvantage is that it's more metadata to manage -- all of the > existing data like dictionaries and stop words, plus this new "common > words". Also, it would mean that almost every match requires RECHECK. It > would be interesting to know how common a word needs to be before it's > better to leave it out of the index. That sounds like it could require an index rebuild if the distribution changes? That would be another plan to pursue, but the MCV is allready there : ftstest=# select * from ftsbody; id | body | ftsbody_body_fts ----+----------------------------------------------+------------------------------------------------- 1 | the cat is not a rat uniqueterm1 uniqueterm2 | 'cat':2 'rat':6 'uniqueterm1':7 'uniqueterm2':8 2 | elephant uniqueterm1 uniqueterm2 | 'eleph':1 'uniqueterm1':2 'uniqueterm2':3 3 | cannon uniqueterm1 uniqueterm2 | 'cannon':1 'uniqueterm1':2 'uniqueterm2':3 (3 rows) ftstest=# select most_common_vals, most_common_freqs from pg_stats where tablename = 'ftsbody' and attname = 'ftsbody_body_fts'; most_common_vals | most_common_freqs ---------------------------+------------------- {uniqueterm1,uniqueterm2} | {1,1,1,1} (1 row) And the query-planner uses this information. -- Jesper. -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance