Hi,
I've been playing with jsonb for storing and querying data from the Mailgun Events API (http://documentation.mailgun.com/api-events.html#examples). I already have a system that parses the JSON to csv and loads into standard tables but what better way to spend the holidays than nerding out on new postgres tech :-)? I am using the official postgres-supplied 9.4 Ubuntu repo on 14.04.
I have been testing both just inserting the raw API call JSON ({"items":[{"item1"....],"paging":...}) and extracting the actual events from the items array ({"item1":...}, {"item2":...}) and inserting per-event rather than per page of 100 events in an "items" array. I did this to try and benefit from GIN indices but much to my surprise, adding indices when there is a line per-event actually drastically *reduces* performance!
mgevents=# create table myevents (event jsonb);
mgevents=# create index idx_myevents on myevents using gin (event jsonb_path_ops);
mgevents=# create index idx_myevents_no_path on myevents using gin (event);
mgevents=# select count(*) from myevents;
count
---------
3715600
(1 row)
mgevents=# explain (analyze, buffers) select count(*) from myevents where event ->> 'event' = 'delivered';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=279884.45..279884.46 rows=1 width=0) (actual time=14626.213..14626.216 rows=1 loops=1)
Buffers: shared read=448208
-> Seq Scan on myevents (cost=0.00..279838.00 rows=18578 width=0) (actual time=0.719..11432.283 rows=1417152 loops=1)
Filter: ((event ->> 'event'::text) = 'delivered'::text)
Rows Removed by Filter: 2298448
Buffers: shared read=448208
Planning time: 0.074 ms
Execution time: 14626.955 ms
(8 rows)
mgevents=# explain (analyze, buffers) select count(*) from myevents where event @> '{"event": "delivered"}';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=3624.38..3624.39 rows=1 width=0) (actual time=80984.095..80984.098 rows=1 loops=1)
Buffers: shared hit=1 read=298589
-> Bitmap Heap Scan on myevents (cost=35.80..3615.09 rows=3716 width=0) (actual time=351.510..77669.907 rows=1417152 loops=1)
Recheck Cond: (event @> '{"event": "delivered"}'::jsonb)
Heap Blocks: exact=298362
Buffers: shared hit=1 read=298589
-> Bitmap Index Scan on idx_myevents (cost=0.00..34.87 rows=3716 width=0) (actual time=257.219..257.219 rows=1417152 loops=1)
Index Cond: (event @> '{"event": "delivered"}'::jsonb)
Buffers: shared hit=1 read=227
Planning time: 3.197 ms
Execution time: 80986.340 ms
(11 rows)
mgevents=# drop index idx_myevents;
mgevents=# explain (analyze, buffers) select count(*) from myevents where event @> '{"event": "delivered"}';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=3631.38..3631.39 rows=1 width=0) (actual time=81898.250..81898.252 rows=1 loops=1)
Buffers: shared hit=383 read=299133
-> Bitmap Heap Scan on myevents (cost=42.80..3622.09 rows=3716 width=0) (actual time=534.816..78526.944 rows=1417152 loops=1)
Recheck Cond: (event @> '{"event": "delivered"}'::jsonb)
Heap Blocks: exact=298362
Buffers: shared hit=383 read=299133
-> Bitmap Index Scan on idx_myevents_no_path (cost=0.00..41.87 rows=3716 width=0) (actual time=453.412..453.412 rows=1417152 loops=1)
Index Cond: (event @> '{"event": "delivered"}'::jsonb)
Buffers: shared hit=383 read=771
Planning time: 2.322 ms
Execution time: 81898.578 ms
(11 rows)
mgevents=# drop index idx_myevents_no_path;
mgevents=# explain (analyze, buffers) select count(*) from myevents where event @> '{"event": "delivered"}';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=270558.29..270558.30 rows=1 width=0) (actual time=19834.530..19834.532 rows=1 loops=1)
Buffers: shared hit=130935 read=317273
-> Seq Scan on myevents (cost=0.00..270549.00 rows=3716 width=0) (actual time=4.650..16764.726 rows=1417152 loops=1)
Filter: (event @> '{"event": "delivered"}'::jsonb)
Rows Removed by Filter: 2298448
Buffers: shared hit=130935 read=317273
Planning time: 0.238 ms
Execution time: 19834.605 ms
(8 rows)
So it doesn't even appear to be a problem with the operator (->> vs @>) but rather that the planner is using the index rather than just a scan. Now as I'm completely new to jsonb and GIN there are probably lots of better ways to do this and configure the server (a VM on my laptop...) but it seems surprising that adding an index should ever drastically reduce select performance! I also tried an index on the event object, with and without jsonb_path_ops
mgevents=# create index idx_myevents_event_type_path on myevents using gin ((event -> 'event') jsonb_path_ops);
But it just ends up doing a scan anyway, so I'm probably doing something wrong.
Any pointers? Thanks!
Anton