working around JSONB's lack of stats?

Josh Berkus <josh@xxxxxxxxxxxx> · Mon, 26 Jan 2015 23:06:09 -0800

Folks,

Currently, JSONB fields don't have statistics, and estimate a flat 1%
selectivity.  This can result in poor query plans, and I'm wondering if
anyone has a suggested workaround for this short of hacking a new
selectivity function.  For example, take the common case of using JSONB
to hold a list of "tags" for tagging documents:

 Table "public.doc_tags_json"
 Column |  Type   | Modifiers
--------+---------+-----------
 doc_id | integer |
 tags   | jsonb   |
Indexes:
    "doc_tags_json_doc_id_idx" UNIQUE, btree (doc_id)
    "doc_tags_json_tags_idx" gin (tags)

This query:

select doc_id
from doc_tags_json
where tags @> '[ "math", "physics" ]'
order by doc_id desc limit 25;

Uses this plan:

     QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..709.79 rows=25 width=4) (actual time=24.529..340.499
rows=25 loops=1)
   ->  Index Scan Backward using doc_tags_json_doc_id_idx on
doc_tags_json  (cost=0.43..283740.95 rows=10000 width=4) (actual
time=24.528..340.483 rows=25 loops=1)
         Filter: (tags @> '["math", "physics"]'::jsonb)
         Rows Removed by Filter: 1011878
 Planning time: 0.090 ms
 Execution time: 340.528 ms

It does this because it expects @> '["math", "physics"]' to match 10,000
rows, which means that it expects to only scan 25,000 entries in the
doc_id index to return the top 25.  However, the matching condition is
much rarer than it thinks, so it's actually far faster to use the index
on the JSONB column:

drop index doc_tags_json_doc_id_idx;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=10517.08..10517.14 rows=25 width=4) (actual
time=7.594..7.602 rows=25 loops=1)
   ->  Sort  (cost=10517.08..10542.08 rows=10000 width=4) (actual
time=7.593..7.596 rows=25 loops=1)
         Sort Key: doc_id
         Sort Method: top-N heapsort  Memory: 26kB
         ->  Bitmap Heap Scan on doc_tags_json  (cost=92.90..10234.89
rows=10000 width=4) (actual time=6.733..7.475 rows=257 loops=1)
               Recheck Cond: (tags @> '["math", "physics"]'::jsonb)
               Heap Blocks: exact=256
               ->  Bitmap Index Scan on doc_tags_json_tags_idx
(cost=0.00..90.40 rows=10000 width=0) (actual time=6.695..6.695 rows=257
loops=1)
                     Index Cond: (tags @> '["math", "physics"]'::jsonb)
 Planning time: 0.093 ms
 Execution time: 7.632 ms

On a normal column, I'd raise n_distinct to reflect the higher
selecivity of the search terms.  However, since @> uses contsel,
n_distinct is ignored.  Anyone know a clever workaround I don't
currently see?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance