btw, 9.4 should be wiser in case of rare+common terms, thanks to GIN fast scan feature. On Sun, Apr 20, 2014 at 8:30 AM, Laurence Parry <greenreaper@xxxxxxxxxxx> wrote: > As mentioned here and elsewhere (most recently in "How can I get the query > planner to use a bitmap index scap instead of an index scan ?" - 8 Mar > 2014), estimation of the relative cost of text search operations using > GIN-indexed columns sometimes goes awry, particularly when there will be a > large number of matches. > > The planner may choose to use a sequential or unrelated index scan with @@ > as a filter, especially when incorporated as a subquery, incurring > significant cost (even without considering de-TOASTing). Pre-tsvectorizing > the column offers only a slight mitigation and can cause regressions (if > nothing else, it adds another large column). > > What worked for me (and I'm hoping for others, though YMMV) was adding > 'OFFSET 0' to the subquery involving the indexed column, e.g. > > ... > (SELECT sk1.submission_id > FROM submission_keywords sk1, keywords k1 > WHERE sk1.keyword_id = k1.keyword_id > AND > to_tsvector('english_nostop', k1.keyword) @@ to_tsquery('english_nostop', > 'tails') > OFFSET 0) > ... > > The result is a bitmap scan: > ------------------------------------------------------------------------------------------ > Nested Loop > (cost=8.73..4740.29 rows=21348 width=4) > (actual time=0.621..13.661 rows=20097 loops=1) > -> Bitmap Heap Scan on keywords k1 > (cost=8.30..1028.72 rows=755 width=4) > (actual time=0.603..2.276 rows=752 loops=1) > Recheck Cond: > (to_tsvector('english_nostop'::regconfig, keyword) @@ > '''tail'''::tsquery) > -> Bitmap Index Scan on keyword_to_tsvector_keywords > (cost=0.00..8.11 rows=755 width=0) > (actual time=0.496..0.496 rows=756 loops=1) > Index Cond: > (to_tsvector('english_nostop'::regconfig, keyword) @@ > '''tail'''::tsquery) > -> Index Only Scan using keyword_id_submission_id_submission_keywords on > submission_keywords sk1 > (cost=0.43..3.47 rows=145 width=8) > (actual time=0.005..0.010 rows=27 loops=752) > Index Cond: (keyword_id = k1.keyword_id) > Heap Fetches: 99 > Total runtime: 14.809 ms > > Without this the test was moved to a filter inside a nested loop, with > disastrous results: > -> Hash Semi Join > (cost=23.37..23.51 rows=1 width=8) > (actual time=0.090..0.090 rows=0 loops=594670) > Hash Cond: (s1.submission_id = sk1.submission_id) > -> Index Only Scan using submissions_pkey on submissions s1 > (cost=0.42..0.56 rows=1 width=4) > (actual time=0.007..0.007 rows=1 loops=17352) > Index Cond: (submission_id = s.submission_id) > Heap Fetches: 8372 > -> Hash > (cost=22.94..22.94 rows=1 width=4) > (actual time=0.086..0.086 rows=0 loops=594670) > Buckets: 1024 Batches: 1 Memory Usage: 0kB > -> Nested Loop > (cost=0.85..22.94 rows=1 width=4) > (actual time=0.083..0.085 rows=0 loops=594670) > -> Index Only Scan using file_keyword on submission_keywords > sk1 > (cost=0.43..0.80 rows=13 width=8) > (actual time=0.006..0.008 rows=9 loops=594670) > Index Cond: (submission_id = s.submission_id) > Heap Fetches: 21324 > -> Index Scan using keywords_pkey on keywords k1 > (cost=0.42..1.69 rows=1 width=4) > (actual time=0.008..0.008 rows=0 loops=5329219) > Index Cond: (keyword_id = sk1.keyword_id) > Filter: (to_tsvector('english_nostop'::regconfig, > keyword) @@ '''tail'''::tsquery) > Total runtime: 55194.034 ms [there are other lines, but 50 sec is above] > > Yes, that's a ~3000x speedup! Not all search terms benefit so much, but we > get a lot of searches for the most common terms, and scans just get worse > the more you add. > > I got the idea from Seamus Abshere: > http://seamusabshere.github.io/2013/03/29/hinting-postgres-and-mysql-with-offset-and-limit/ > > I've heard it said that "any Postgres DBA worth his salt" knows this trick, > as well as the use of "WITH" to create a common table expression. Alas, many > of us are still learning . . . I beat my head over this for a week, and it's > affected our site for far longer. This kind of issue makes people think they > need to replace PostgreSQL with a dedicated search solution to be able to > scale, which is a shame. > > I know hinting has a bad rep, but this is a localized fix, and what has been > said before leads me to believe that estimating the cost of such situations > is a hard nut to crack - one which is not on anyone's plate right now. > > Incidentally, documentation section 7.6. "LIMIT and OFFSET" states that > "OFFSET 0 is the same as omitting the OFFSET clause" which is clearly not > the case here. I appreciate that this is an implementation detail which > might change, but it's an important one that I think deserves mentioning. > > Hope this helps, > -- > Laurence "GreenReaper" Parry > greenreaper.co.uk - wikifur.com - flayrah.com - inkbunny.net > "Eternity lies ahead of us, and behind. Have you drunk your fill?" > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance