Hi, I recently stumbled across an interesting query performance question over at StackOverflow [1], which caught my attention and I started to investigate the issue further. One of the things consuming most of the time was an Index Only Scan executed millions of times. And on top came the Nested Loop which finally reduced the rows but also took a lot of time to do so. Explain plan: https://explain.depesz.com/s/4GYT My initial test using a ~200M row table [2] revealed a completely different plan with a no-brainer performance of ~10.7 milliseconds vs. the original 6.25 minutes (10x more rows cannot explain that). As the query included a LIMIT I also started to play around with OFFSET because I know that at least OFFSET + LIMIT (sorted) rows have to be read, with interesting result using one of my optimization attempts: To eliminate the JOIN and the massive amount of loops I decided to generate a huge UNION ALL query using a function [3] on-the-fly and although performing pretty poor for no OFFSET, it makes quite a huge difference with higher ones. Here, the times for "exec_a" are with plans as generated by PostgreSQL 11.2 and "exec_d" with my UNION ALL version (all times include planning and execution): c_offset | exec_a | exec_d ----------+-----------+----------- 0 | 10.694 | 746.892 10000 | 175.858 | 653.218 100000 | 1632.205 | 791.913 1000000 | 11244.091 | 2274.160 5000000 | 11567.438 | 9428.352 10000000 | 13442.229 | 17026.783 Complete plans for all executions here: exec_a: https://explain.depesz.com/s/Ck1 exec_d: https://explain.depesz.com/s/ZoUu A retest after upgrading to PostgreSQL 11.3 and adding another 200M rows revealed even different numbers: c_offset | exec_a | exec_a_x2 | exec_d | exec_d_x2 ----------+-----------+-----------+-----------+----------- 0 | 10.694 | 16.616 | 746.892 | 630.440 10000 | 175.858 | 182.922 | 653.218 | 646.173 100000 | 1632.205 | 1682.033 | 791.913 | 782.874 1000000 | 11244.091 | 24781.706 | 2274.160 | 2306.577 5000000 | 11567.438 | 24798.120 | 9428.352 | 8886.781 10000000 | 13442.229 | 27315.650 | 17026.783 | 16808.223 One major difference for the "exec_a" plans is that starting with OFFSET of 1000000, the planner switches from a "Merge Append" + "Nested Loop" to a "Parallel Append" + "Hash Join" + "Sort" + "Gather Merge", whereas the plans for "exec_d" always remain single-threaded. My question now is why can't the optimizer generate a plan that in this case does 114 loops of "events" scans instead of a million loops on the "subscription_signal"? There even is an index that spans both relevant columns here (see [2]) which is used extensively in my UNION variant (as intended and expected). Also I observed that while the parallel append is going to be faster eventually due to better I/O scalability (at least on my system using an SSD separately for log and different index/data tablespaces) it leads to a lot of CPU cores being saturated as well as a lot more I/O in general and also includes the bottleneck of per-worker disk-sorts. From the perspective of system resources this is not really helpful and it also doesn't seem to bring much benefit in my case as parallel append just saves ~10-20% (for OFFSET 1000000) vs. standard Append (with parallel index/seq scans of partitions). Using a single-threaded approach (to preserve resources for concurrent queries, max_parallel_workers_per_gather = 0), the UNION ALL approach is superior starting at offset 100000: c_offset | exec_a | exec_d ----------+-----------+----------- 0 | 18.028 | 292.762 10000 | 188.548 | 308.824 100000 | 1710.029 | 455.101 1000000 | 81325.527 | 1993.886 5000000 | 84206.901 | 8638.194 10000000 | 84846.488 | 16814.890 One thing that really disturbs me in this case is the decision of the optimizer to generate an Append + Hash starting with offset 1000000 instead of simply continuing with a Merge Append, which pushes down limits and returns just 10M intermediate rows whereas Append does not - yet - and results into 270M intermediate rows, resulting these numbers (enable_hashjoin turned off to force a Merge Append): c_offset | exec_a | exec_a_m ----------+-----------+------------ 1000000 | 81325.527 | 16517.566 ...but then degrades further because it switches to Append again (no way to test a Merge Append performance here, I guess): 5000000 | 84206.901 | 107161.533 10000000 | 84846.488 | 109368.087 Is there anything I can do about it (apart from my generated huge UNION) to speed things up? Please note that I'm using Timescale extension just as a simple way of managing the partitions and indexes and intentionally set the time column to a one different from the query filter to not have it optimize things away under the hood. Looking forward to any pointers here. Cheers, Ancoron Refs: [1] https://stackoverflow.com/questions/55470713 [2] https://paste.ofcode.org/szj7f7fCSYk7jQNdd5Wvbx [3] https://paste.ofcode.org/ibZ8fNmNFDrsyxa3NktdWB