> On 08-07-2021, at 04:26, Joel Frid <joel.frid@xxxxxxxxxxxxxxxx> wrote: > > Hi, > > On my production environment (PostgreSQL 13.3), one of my queries runs very slow, about 2 minutes. > I noticed that it does not use an execution plan that I anticapited it would. > > The query is : > > SELECT t.* > FROM test t > WHERE t."existe" IS true > and t.json_data" @> '{"book":{"title":"In Search of Lost Time"}}' > ORDER BY t."id" DESC > LIMIT 100 OFFSET 0 > > I know PostgreSQL is not very good at performing well with pagination and offsets but all my queries must end with "LIMIT 100 OFFSET 0", "LIMIT 100 OFFSET 1", ... I don't think any database performs well with LIMIT+OFFSET. Using OFFSET requires doing a linear scan discarding all rows up to the row in position OFFSET, then the scan continues for LIMIT rows. The greater the value of OFFSET, the slowest the query will perform, in general. I'd recommend you using cursors for pagination in general (I know it may not be possible for you, just wanted to explain as it could be useful). > If I display actual Execution Plan, I get this : > > Limit (cost=0.43..1164.55 rows=100 width=632) (actual time=7884.056..121297.756 rows=1 loops=1) > Buffers: shared hit=5311835 read=585741 dirtied=32 > -> Index Scan Backward using test_pk on test (cost=0.43..141104.29 rows=12121 width=632) (actual time=7884.053..121297.734 rows=1 loops=1) > Filter: ((existe IS TRUE) AND (json_data @> '{"book": {"title": "In Search of Lost Time"}}'::jsonb)) > Rows Removed by Filter: 1215681 > Buffers: shared hit=5311835 read=585741 dirtied=32 > Planning: > Buffers: shared hit=1 > Planning Time: 0.283 ms > Execution Time: 121297.878 ms > > The query runs very slow from limit 1 to 1147. > If I change limit value to 1148, this query runs quite fast ( 0.190 ms) with a nice execution plan : > > SELECT t.* > FROM test t > WHERE t."existe" IS true > and t.json_data" @> '{"book":{"title":"In Search of Lost Time"}}' > ORDER BY t."id" DESC > LIMIT 1148 OFFSET 0 > > Limit (cost=13220.53..13223.40 rows=1148 width=632) (actual time=0.138..0.140 rows=1 loops=1) > Buffers: shared hit=17 > -> Sort (cost=13220.53..13250.84 rows=12121 width=632) (actual time=0.137..0.138 rows=1 loops=1) > Sort Key: id DESC > Sort Method: quicksort Memory: 27kB > Buffers: shared hit=17 > -> Bitmap Heap Scan on test (cost=119.73..12543.88 rows=12121 width=632) (actual time=0.125..0.127 rows=1 loops=1) > Recheck Cond: (json_data @> '{"book": {"title": "In Search of Lost Time"}}'::jsonb) > Filter: (existe IS TRUE) > Heap Blocks: exact=1 > Buffers: shared hit=17 > -> Bitmap Index Scan on test_json_data_idx (cost=0.00..116.70 rows=12187 width=0) (actual time=0.112..0.113 rows=1 loops=1) > Index Cond: (json_data @> '{"book": {"title": "In Search of Lost Time"}}'::jsonb) > Buffers: shared hit=16 > Planning: > Buffers: shared hit=1 > Planning Time: 0.296 ms > Execution Time: 0.190 ms > > Would you have any suggestions why Postgres chooses a so bad query plan ? I can guess a bit about this, One of the perks of the statistics collector is that it doesn't collect too many statistics to properly estimate the number of rows that will match the "@>" operator, as that is quite hard to do. In general it over-estimates how many rows will match. As you can see on both plans it estimates that 12121 rows will match the "@>" clause, even if only 1 actually match. This means that the planner is estimating that the cost of using test_json_data_idx and then executing top-k heapsort (to get the latest 100 rows ordered by id) is far greater than iterating over the test_pk index where the first 100 rows that match will be the ones you need and already sorted. In practice, iterating the test_pk index has to read over a lot of rows that didn't match the "@>" operator, as the actual number of rows that match isn't as large as initially expected. Once you increase the LIMIT value to a number high enough the cost of iterating over any of the indexes becomes similar to the planner, after that threshold it chooses to switch the plan. So, some suggestions to improve the execution of that query: Option A: Use a common table expression to "force" the usage of test_json_data_idx WITH json_matching_rows AS ( SELECT t.* FROM test ti WHERE t.json_data @> '{"book":{"title":"In Search of Lost Time"}}' ) SELECT t.* FROM json_matching_rows t WHERE t."existe" IS true ORDER BY t."id" DESC LIMIT 100 OFFSET 0; Option B: Use the extension pg_hint_plan to hint the usage of test_json_data_idx Option C: Create a functional index for the book title and tweak the query to use it. This can also be a composite index (to have the values sorted by id already) and partial (to only include rows where "existe" is true) CREATE INDEX test_json_data_composite_idx ON test USING BTREE ((json_data->'book'->>'title'), id DESC) WHERE (existe); SELECT t.* FROM test t WHERE t."existe" IS true and t.json_data->'book'->>'title' = 'In Search of Lost Time' ORDER BY t."id" DESC LIMIT 100 OFFSET 0; Be aware that partial indexes don't support HOT updates. I hope this reply helps you. Best regards, Manuel Weitzman