Speeding up JSON + TSQUERY + GIN

"Sven R. Kunze" <srkunze@xxxxxxx> · Sun, 26 Feb 2017 14:28:10 +0100



    Hello everyone,

    
    I am currently evaluating the possibility of
      using PostgreSQL for storing and querying jsonb+tsvector queries.
      Let's consider this setup:

      
      create table docs (id serial primary key, meta jsonb);

      # generate 10M entries, cf. appendix

      create index docs_meta_idx ON docs using gin (meta
      jsonb_path_ops);

    create index docs_name_idx ON docs using gin
      (to_tsvector('english', meta->>'name'));

      create index docs_address_idx ON docs using gin
      (to_tsvector('english', meta->>'address'));

      
      Testing around with some smaller datasets, functionality-wise it's
      great. However increasing to 10M, things tend to slow down (using
      PostgreSQL 9.5):

      
      explain analyze select id from docs where meta @> '{"age":
      20}';

       Planning time: 0.121 ms

       Execution time: 4873.507 ms

      
      explain analyze select id from docs where meta @> '{"age":
      20}';

       Planning time: 0.122 ms

       Execution time: 206.289 ms

      
      explain analyze select id from docs where meta @> '{"age":
      30}';

       Planning time: 0.109 ms

       Execution time: 7496.886 ms

      
      explain analyze select id from docs where meta @> '{"age":
      30}';

       Planning time: 0.114 ms

       Execution time: 1169.649 ms

      
      explain analyze select id from docs where to_tsvector('english',
      meta->>'name') @@ to_tsquery('english', 'john');

       Planning time: 0.179 ms

       Execution time: 10109.375 ms

      
    explain analyze select id from docs where
        to_tsvector('english', meta->>'name') @@
        to_tsquery('english', 'john');

         Planning time: 0.188 ms

       Execution time: 238.854 ms

      
      Using "select pg_prewarm('docs');" and on any of the indexes
      doesn't help either.

      After a "systemctl stop postgresql.service && sync
      && echo 3 > /proc/sys/vm/drop_caches &&
      systemctl start postgresql.service" the age=20, 30 or name=john
      queries are slow again.

      
      Is there a way to speed up or to warm up things permanently?

    
    Regards,

    Sven

      
      Appendix I:

      
      example json:

      
      {"age": 20, "name": "Michelle Hernandez", "birth": "1991-08-16",
      "address": "94753 Tina Bridge Suite 318\\nEmilyport, MT 75302"}

      
      Appendix II:

      
      The Python script to generate fake json data. Needs "pip install
      faker". 

      
      >>> python fake_json.py > test.json  # generates 2M
      entries; takes some time

      >>> cat test.json | psql -c 'copy docs (meta) from stdin'

    >>> cat test.json | psql -c 'copy docs (meta) from
      stdin'

    >>> cat test.json | psql -c 'copy docs (meta) from
        stdin'

      >>> cat test.json | psql -c 'copy
          docs (meta) from stdin'

        >>> cat test.json | psql -c
        'copy docs (meta) from stdin'

      
      -- fake_json.py --

      
      import faker, json;

      fake = faker.Faker();

      for i in range(2*10**6):

          print(json.dumps({"name": fake.name(), "birth": fake.date(),
      "address": fake.address(), "age":
      fake.random_int(0,100)}).replace('\\n', '\\\\n'))