Workaround: Planner preference for tsquery filter vs. GIN index in fast text search

"Laurence Parry" <greenreaper@xxxxxxxxxxx> · Sat, 19 Apr 2014 23:30:36 -0500

As mentioned here and elsewhere (most recently in "How can I get the query 
planner to use a bitmap index scap instead of an index scan ?" - 8 Mar 
2014), estimation of the relative cost of text search operations using 
GIN-indexed columns sometimes goes awry, particularly when there will be a 
large number of matches.
The planner may choose to use a sequential or unrelated index scan with @@ 
as a filter, especially when incorporated as a subquery, incurring 
significant cost (even without considering de-TOASTing). Pre-tsvectorizing 
the column offers only a slight mitigation and can cause regressions (if 
nothing else, it adds another large column).
What worked for me (and I'm hoping for others, though YMMV) was adding 
'OFFSET 0' to the subquery involving the indexed column, e.g.
...
(SELECT sk1.submission_id
FROM submission_keywords sk1, keywords k1
WHERE sk1.keyword_id = k1.keyword_id
   AND
to_tsvector('english_nostop', k1.keyword) @@ to_tsquery('english_nostop', 
'tails')
OFFSET 0)
...

The result is a bitmap scan:
------------------------------------------------------------------------------------------
Nested Loop
(cost=8.73..4740.29 rows=21348 width=4)
(actual time=0.621..13.661 rows=20097 loops=1)
   ->  Bitmap Heap Scan on keywords k1
       (cost=8.30..1028.72 rows=755 width=4)
       (actual time=0.603..2.276 rows=752 loops=1)
       Recheck Cond:
       (to_tsvector('english_nostop'::regconfig, keyword) @@ 
'''tail'''::tsquery)
       ->  Bitmap Index Scan on keyword_to_tsvector_keywords
           (cost=0.00..8.11 rows=755 width=0)
           (actual time=0.496..0.496 rows=756 loops=1)
           Index Cond:
           (to_tsvector('english_nostop'::regconfig, keyword) @@ 
'''tail'''::tsquery)
   ->  Index Only Scan using keyword_id_submission_id_submission_keywords 
on submission_keywords sk1
       (cost=0.43..3.47 rows=145 width=8)
       (actual time=0.005..0.010 rows=27 loops=752)
       Index Cond: (keyword_id = k1.keyword_id)
       Heap Fetches: 99
Total runtime: 14.809 ms

Without this the test was moved to a filter inside a nested loop, with 
disastrous results:
->  Hash Semi Join
   (cost=23.37..23.51 rows=1 width=8)
   (actual time=0.090..0.090 rows=0 loops=594670)
   Hash Cond: (s1.submission_id = sk1.submission_id)
   ->  Index Only Scan using submissions_pkey on submissions s1
       (cost=0.42..0.56 rows=1 width=4)
       (actual time=0.007..0.007 rows=1 loops=17352)
       Index Cond: (submission_id = s.submission_id)
       Heap Fetches: 8372
       ->  Hash
           (cost=22.94..22.94 rows=1 width=4)
           (actual time=0.086..0.086 rows=0 loops=594670)
           Buckets: 1024  Batches: 1  Memory Usage: 0kB
           ->  Nested Loop
               (cost=0.85..22.94 rows=1 width=4)
               (actual time=0.083..0.085 rows=0 loops=594670)
               ->  Index Only Scan using file_keyword on 
submission_keywords sk1
                   (cost=0.43..0.80 rows=13 width=8)
                   (actual time=0.006..0.008 rows=9 loops=594670)
                   Index Cond: (submission_id = s.submission_id)
                   Heap Fetches: 21324
                   ->  Index Scan using keywords_pkey on keywords k1
                       (cost=0.42..1.69 rows=1 width=4)
                       (actual time=0.008..0.008 rows=0 loops=5329219)
                       Index Cond: (keyword_id = sk1.keyword_id)
                       Filter: (to_tsvector('english_nostop'::regconfig, 
keyword) @@ '''tail'''::tsquery)
Total runtime: 55194.034 ms [there are other lines, but 50 sec is above]

Yes, that's a ~3000x speedup! Not all search terms benefit so much, but we 
get a lot of searches for the most common terms, and scans just get worse 
the more you add.
I got the idea from Seamus Abshere:
http://seamusabshere.github.io/2013/03/29/hinting-postgres-and-mysql-with-offset-and-limit/

I've heard it said that "any Postgres DBA worth his salt" knows this trick, 
as well as the use of "WITH" to create a common table expression. Alas, many 
of us are still learning . . . I beat my head over this for a week, and it's 
affected our site for far longer. This kind of issue makes people think they 
need to replace PostgreSQL with a dedicated search solution to be able to 
scale, which is a shame.
I know hinting has a bad rep, but this is a localized fix, and what has been 
said before leads me to believe that estimating the cost of such situations 
is a hard nut to crack - one which is not on anyone's plate right now.
Incidentally, documentation section 7.6. "LIMIT and OFFSET" states that 
"OFFSET 0 is the same as omitting the OFFSET clause" which is clearly not 
the case here. I appreciate that this is an implementation detail which 
might change, but it's an important one that I think deserves mentioning.
Hope this helps,
--
Laurence "GreenReaper" Parry
greenreaper.co.uk - wikifur.com - flayrah.com - inkbunny.net
"Eternity lies ahead of us, and behind. Have you drunk your fill?" 

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance