Re: Improve Seq scan performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




OK, I see your problem. Try this :

read this : http://www.postgresql.org/docs/current/static/pgtrgm.html
locate and \i the pg_trgm.sql file

CREATE TABLE dict( s TEXT );

I loaded the english - german dictionary in a test table. I didn't parse it, so it's just a bunch of 418552 strings, english and german mixed.

test=> EXPLAIN ANALYZE SELECT * FROM dict WHERE s LIKE '%tation%';
                                               QUERY PLAN
--------------------------------------------------------------------------------------------------------
Seq Scan on dict (cost=0.00..7445.90 rows=133 width=13) (actual time=0.102..217.155 rows=802 loops=1)
   Filter: (s ~~ '%tation%'::text)
 Total runtime: 217.451 ms
(3 lignes)

Temps : 217,846 ms

Since this data does not change very often, let's use a gin index.

CREATE INDEX trgm_idx ON dict USING gin (s gin_trgm_ops);

With trigrams we can search by similarity. So, we can issue this :

EXPLAIN ANALYZE SELECT s, similarity(s, 'tation') AS sml FROM dict WHERE s % 'tation' ORDER BY sml DESC, s;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Sort (cost=1114.44..1115.49 rows=419 width=13) (actual time=190.778..190.980 rows=500 loops=1)
   Sort Key: (similarity(s, 'tation'::text)), s
   Sort Method:  quicksort  Memory: 37kB
-> Bitmap Heap Scan on dict (cost=35.80..1096.19 rows=419 width=13) (actual time=113.486..188.825 rows=500 loops=1)
         Filter: (s % 'tation'::text)
-> Bitmap Index Scan on trgm_idx (cost=0.00..35.69 rows=419 width=0) (actual time=112.011..112.011 rows=15891 loops=1)
               Index Cond: (s % 'tation'::text)
 Total runtime: 191.189 ms

It is not much faster than the seq scan, but it can give you useful results, correct spelling errors, etc.
Perhaps it's faster when data is not cached.
Sample of returns :

 taxation                |      0.6
 station                 |      0.5
 tabulation              |      0.5
 taction                 |      0.5
 Taxation {f}            |      0.5
 Taxation {f}            |      0.5

If you do not want to correct for spelling errors, you can do like this :

EXPLAIN ANALYZE SELECT s FROM dict WHERE s LIKE '%tation%' AND s % 'tation';
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on dict (cost=35.70..1096.09 rows=1 width=13) (actual time=66.583..80.980 rows=306 loops=1)
   Filter: ((s ~~ '%tation%'::text) AND (s % 'tation'::text))
-> Bitmap Index Scan on trgm_idx (cost=0.00..35.69 rows=419 width=0) (actual time=65.799..65.799 rows=15891 loops=1)
         Index Cond: (s % 'tation'::text)
 Total runtime: 81.140 ms
(5 lignes)

Temps : 81,652 ms

In this case the trigram index is used to narrow the search, and the LIKE to get only exact matches.

Careful though, it might not always match, for instance if you search "rat" you won't find "consideration", because the search string is too small.

Anyway, I would suggest to change your strategy.

You could try preloading everything into an in-memory array of strings. This would be much faster. You could also try to build a list of unique words from your dictionary, which contains lots of expressions. Then, when the user enters a query, get the words that contain the entered text, and use a full-text index to search your dictionary.

I tested first only some words. And later with '%a%', '%b% etc. When I re-query the table with the used term (e.g. 1.'%a%' -slow, 2. '%b%'- slow, '%a%' - fast), it is faster than the old method.

When the user enters a very short string like 'a' or 'is', I don't think it is relevant to display all entries that contain this, because that could be most of your dictionary. Instead, why not display all unique words which start with this string ? Much less results, faster, and probably more useful too. Then the user can select an longer word and use this.

Also, pagination is overrated. If there are 50 pages of results, the user will never click on them anyway. They are more likely to refine their query instead. So, just display the first 100 results and be done with it ;)





--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux