Re: Workaround: Planner preference for tsquery filter vs. GIN index in fast text search

Oleg Bartunov <obartunov@xxxxxxxxx> · Sun, 20 Apr 2014 08:46:01 +0400

btw, 9.4 should be wiser in case of rare+common terms, thanks to GIN
fast scan feature.

On Sun, Apr 20, 2014 at 8:30 AM, Laurence Parry <greenreaper@xxxxxxxxxxx> wrote:
> As mentioned here and elsewhere (most recently in "How can I get the query
> planner to use a bitmap index scap instead of an index scan ?" - 8 Mar
> 2014), estimation of the relative cost of text search operations using
> GIN-indexed columns sometimes goes awry, particularly when there will be a
> large number of matches.
>
> The planner may choose to use a sequential or unrelated index scan with @@
> as a filter, especially when incorporated as a subquery, incurring
> significant cost (even without considering de-TOASTing). Pre-tsvectorizing
> the column offers only a slight mitigation and can cause regressions (if
> nothing else, it adds another large column).
>
> What worked for me (and I'm hoping for others, though YMMV) was adding
> 'OFFSET 0' to the subquery involving the indexed column, e.g.
>
> ...
> (SELECT sk1.submission_id
> FROM submission_keywords sk1, keywords k1
> WHERE sk1.keyword_id = k1.keyword_id
>    AND
> to_tsvector('english_nostop', k1.keyword) @@ to_tsquery('english_nostop',
> 'tails')
> OFFSET 0)
> ...
>
> The result is a bitmap scan:
> ------------------------------------------------------------------------------------------
> Nested Loop
> (cost=8.73..4740.29 rows=21348 width=4)
> (actual time=0.621..13.661 rows=20097 loops=1)
>    ->  Bitmap Heap Scan on keywords k1
>        (cost=8.30..1028.72 rows=755 width=4)
>        (actual time=0.603..2.276 rows=752 loops=1)
>        Recheck Cond:
>        (to_tsvector('english_nostop'::regconfig, keyword) @@
> '''tail'''::tsquery)
>        ->  Bitmap Index Scan on keyword_to_tsvector_keywords
>            (cost=0.00..8.11 rows=755 width=0)
>            (actual time=0.496..0.496 rows=756 loops=1)
>            Index Cond:
>            (to_tsvector('english_nostop'::regconfig, keyword) @@
> '''tail'''::tsquery)
>    ->  Index Only Scan using keyword_id_submission_id_submission_keywords on
> submission_keywords sk1
>        (cost=0.43..3.47 rows=145 width=8)
>        (actual time=0.005..0.010 rows=27 loops=752)
>        Index Cond: (keyword_id = k1.keyword_id)
>        Heap Fetches: 99
> Total runtime: 14.809 ms
>
> Without this the test was moved to a filter inside a nested loop, with
> disastrous results:
> ->  Hash Semi Join
>    (cost=23.37..23.51 rows=1 width=8)
>    (actual time=0.090..0.090 rows=0 loops=594670)
>    Hash Cond: (s1.submission_id = sk1.submission_id)
>    ->  Index Only Scan using submissions_pkey on submissions s1
>        (cost=0.42..0.56 rows=1 width=4)
>        (actual time=0.007..0.007 rows=1 loops=17352)
>        Index Cond: (submission_id = s.submission_id)
>        Heap Fetches: 8372
>        ->  Hash
>            (cost=22.94..22.94 rows=1 width=4)
>            (actual time=0.086..0.086 rows=0 loops=594670)
>            Buckets: 1024  Batches: 1  Memory Usage: 0kB
>            ->  Nested Loop
>                (cost=0.85..22.94 rows=1 width=4)
>                (actual time=0.083..0.085 rows=0 loops=594670)
>                ->  Index Only Scan using file_keyword on submission_keywords
> sk1
>                    (cost=0.43..0.80 rows=13 width=8)
>                    (actual time=0.006..0.008 rows=9 loops=594670)
>                    Index Cond: (submission_id = s.submission_id)
>                    Heap Fetches: 21324
>                    ->  Index Scan using keywords_pkey on keywords k1
>                        (cost=0.42..1.69 rows=1 width=4)
>                        (actual time=0.008..0.008 rows=0 loops=5329219)
>                        Index Cond: (keyword_id = sk1.keyword_id)
>                        Filter: (to_tsvector('english_nostop'::regconfig,
> keyword) @@ '''tail'''::tsquery)
> Total runtime: 55194.034 ms [there are other lines, but 50 sec is above]
>
> Yes, that's a ~3000x speedup! Not all search terms benefit so much, but we
> get a lot of searches for the most common terms, and scans just get worse
> the more you add.
>
> I got the idea from Seamus Abshere:
> http://seamusabshere.github.io/2013/03/29/hinting-postgres-and-mysql-with-offset-and-limit/
>
> I've heard it said that "any Postgres DBA worth his salt" knows this trick,
> as well as the use of "WITH" to create a common table expression. Alas, many
> of us are still learning . . . I beat my head over this for a week, and it's
> affected our site for far longer. This kind of issue makes people think they
> need to replace PostgreSQL with a dedicated search solution to be able to
> scale, which is a shame.
>
> I know hinting has a bad rep, but this is a localized fix, and what has been
> said before leads me to believe that estimating the cost of such situations
> is a hard nut to crack - one which is not on anyone's plate right now.
>
> Incidentally, documentation section 7.6. "LIMIT and OFFSET" states that
> "OFFSET 0 is the same as omitting the OFFSET clause" which is clearly not
> the case here. I appreciate that this is an implementation detail which
> might change, but it's an important one that I think deserves mentioning.
>
> Hope this helps,
> --
> Laurence "GreenReaper" Parry
> greenreaper.co.uk - wikifur.com - flayrah.com - inkbunny.net
> "Eternity lies ahead of us, and behind. Have you drunk your fill?"
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance