Workaround: Planner preference for tsquery filter vs. GIN index in fast text search

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



As mentioned here and elsewhere (most recently in "How can I get the query planner to use a bitmap index scap instead of an index scan ?" - 8 Mar 2014), estimation of the relative cost of text search operations using GIN-indexed columns sometimes goes awry, particularly when there will be a large number of matches.

The planner may choose to use a sequential or unrelated index scan with @@ as a filter, especially when incorporated as a subquery, incurring significant cost (even without considering de-TOASTing). Pre-tsvectorizing the column offers only a slight mitigation and can cause regressions (if nothing else, it adds another large column).

What worked for me (and I'm hoping for others, though YMMV) was adding 'OFFSET 0' to the subquery involving the indexed column, e.g.

...
(SELECT sk1.submission_id
FROM submission_keywords sk1, keywords k1
WHERE sk1.keyword_id = k1.keyword_id
   AND
to_tsvector('english_nostop', k1.keyword) @@ to_tsquery('english_nostop', 'tails')
OFFSET 0)
...

The result is a bitmap scan:
------------------------------------------------------------------------------------------
Nested Loop
(cost=8.73..4740.29 rows=21348 width=4)
(actual time=0.621..13.661 rows=20097 loops=1)
   ->  Bitmap Heap Scan on keywords k1
       (cost=8.30..1028.72 rows=755 width=4)
       (actual time=0.603..2.276 rows=752 loops=1)
       Recheck Cond:
(to_tsvector('english_nostop'::regconfig, keyword) @@ '''tail'''::tsquery)
       ->  Bitmap Index Scan on keyword_to_tsvector_keywords
           (cost=0.00..8.11 rows=755 width=0)
           (actual time=0.496..0.496 rows=756 loops=1)
           Index Cond:
(to_tsvector('english_nostop'::regconfig, keyword) @@ '''tail'''::tsquery) -> Index Only Scan using keyword_id_submission_id_submission_keywords on submission_keywords sk1
       (cost=0.43..3.47 rows=145 width=8)
       (actual time=0.005..0.010 rows=27 loops=752)
       Index Cond: (keyword_id = k1.keyword_id)
       Heap Fetches: 99
Total runtime: 14.809 ms

Without this the test was moved to a filter inside a nested loop, with disastrous results:
->  Hash Semi Join
   (cost=23.37..23.51 rows=1 width=8)
   (actual time=0.090..0.090 rows=0 loops=594670)
   Hash Cond: (s1.submission_id = sk1.submission_id)
   ->  Index Only Scan using submissions_pkey on submissions s1
       (cost=0.42..0.56 rows=1 width=4)
       (actual time=0.007..0.007 rows=1 loops=17352)
       Index Cond: (submission_id = s.submission_id)
       Heap Fetches: 8372
       ->  Hash
           (cost=22.94..22.94 rows=1 width=4)
           (actual time=0.086..0.086 rows=0 loops=594670)
           Buckets: 1024  Batches: 1  Memory Usage: 0kB
           ->  Nested Loop
               (cost=0.85..22.94 rows=1 width=4)
               (actual time=0.083..0.085 rows=0 loops=594670)
-> Index Only Scan using file_keyword on submission_keywords sk1
                   (cost=0.43..0.80 rows=13 width=8)
                   (actual time=0.006..0.008 rows=9 loops=594670)
                   Index Cond: (submission_id = s.submission_id)
                   Heap Fetches: 21324
                   ->  Index Scan using keywords_pkey on keywords k1
                       (cost=0.42..1.69 rows=1 width=4)
                       (actual time=0.008..0.008 rows=0 loops=5329219)
                       Index Cond: (keyword_id = sk1.keyword_id)
Filter: (to_tsvector('english_nostop'::regconfig, keyword) @@ '''tail'''::tsquery)
Total runtime: 55194.034 ms [there are other lines, but 50 sec is above]

Yes, that's a ~3000x speedup! Not all search terms benefit so much, but we get a lot of searches for the most common terms, and scans just get worse the more you add.

I got the idea from Seamus Abshere:
http://seamusabshere.github.io/2013/03/29/hinting-postgres-and-mysql-with-offset-and-limit/

I've heard it said that "any Postgres DBA worth his salt" knows this trick, as well as the use of "WITH" to create a common table expression. Alas, many of us are still learning . . . I beat my head over this for a week, and it's affected our site for far longer. This kind of issue makes people think they need to replace PostgreSQL with a dedicated search solution to be able to scale, which is a shame.

I know hinting has a bad rep, but this is a localized fix, and what has been said before leads me to believe that estimating the cost of such situations is a hard nut to crack - one which is not on anyone's plate right now.

Incidentally, documentation section 7.6. "LIMIT and OFFSET" states that "OFFSET 0 is the same as omitting the OFFSET clause" which is clearly not the case here. I appreciate that this is an implementation detail which might change, but it's an important one that I think deserves mentioning.

Hope this helps,
--
Laurence "GreenReaper" Parry
greenreaper.co.uk - wikifur.com - flayrah.com - inkbunny.net
"Eternity lies ahead of us, and behind. Have you drunk your fill?"


--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance




[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux