Huge generated UNION ALL faster than JOIN?

Ancoron Luciferis <ancoron.luciferis@xxxxxxxxxxxxxx> · Sun, 12 May 2019 16:08:44 +0200

Hi,

I recently stumbled across an interesting query performance question
over at StackOverflow [1], which caught my attention and I started to
investigate the issue further.

One of the things consuming most of the time was an Index Only Scan
executed millions of times. And on top came the Nested Loop which
finally reduced the rows but also took a lot of time to do so.

Explain plan: https://explain.depesz.com/s/4GYT

My initial test using a ~200M row table [2] revealed a completely
different plan with a no-brainer performance of ~10.7 milliseconds vs.
the original 6.25 minutes (10x more rows cannot explain that).

As the query included a LIMIT I also started to play around with OFFSET
because I know that at least OFFSET + LIMIT (sorted) rows have to be
read, with interesting result using one of my optimization attempts:

To eliminate the JOIN and the massive amount of loops I decided to
generate a huge UNION ALL query using a function [3] on-the-fly and
although performing pretty poor for no OFFSET, it makes quite a huge
difference with higher ones.

Here, the times for "exec_a" are with plans as generated by PostgreSQL
11.2 and "exec_d" with my UNION ALL version (all times include planning
and execution):

 c_offset |  exec_a   |  exec_d
----------+-----------+-----------
        0 |    10.694 |   746.892
    10000 |   175.858 |   653.218
   100000 |  1632.205 |   791.913
  1000000 | 11244.091 |  2274.160
  5000000 | 11567.438 |  9428.352
 10000000 | 13442.229 | 17026.783

Complete plans for all executions here:
exec_a: https://explain.depesz.com/s/Ck1
exec_d: https://explain.depesz.com/s/ZoUu

A retest after upgrading to PostgreSQL 11.3 and adding another 200M rows
revealed even different numbers:

 c_offset |  exec_a   | exec_a_x2 |  exec_d   | exec_d_x2
----------+-----------+-----------+-----------+-----------
        0 |    10.694 |    16.616 |   746.892 |   630.440
    10000 |   175.858 |   182.922 |   653.218 |   646.173
   100000 |  1632.205 |  1682.033 |   791.913 |   782.874
  1000000 | 11244.091 | 24781.706 |  2274.160 |  2306.577
  5000000 | 11567.438 | 24798.120 |  9428.352 |  8886.781
 10000000 | 13442.229 | 27315.650 | 17026.783 | 16808.223

One major difference for the "exec_a" plans is that starting with OFFSET
of 1000000, the planner switches from a "Merge Append" + "Nested Loop"
to a "Parallel Append" + "Hash Join" + "Sort" + "Gather Merge", whereas
the plans for "exec_d" always remain single-threaded.

My question now is why can't the optimizer generate a plan that in this
case does 114 loops of "events" scans instead of a million loops on the
"subscription_signal"? There even is an index that spans both relevant
columns here (see [2]) which is used extensively in my UNION variant (as
intended and expected).

Also I observed that while the parallel append is going to be faster
eventually due to better I/O scalability (at least on my system using an
SSD separately for log and different index/data tablespaces) it leads to
a lot of CPU cores being saturated as well as a lot more I/O in general
and also includes the bottleneck of per-worker disk-sorts. From the
perspective of system resources this is not really helpful and it also
doesn't seem to bring much benefit in my case as parallel append just
saves ~10-20% (for OFFSET 1000000) vs. standard Append (with parallel
index/seq scans of partitions).

Using a single-threaded approach (to preserve resources for concurrent
queries, max_parallel_workers_per_gather = 0), the UNION ALL approach is
superior starting at offset 100000:

 c_offset |  exec_a   |  exec_d
----------+-----------+-----------
        0 |    18.028 |   292.762
    10000 |   188.548 |   308.824
   100000 |  1710.029 |   455.101
  1000000 | 81325.527 |  1993.886
  5000000 | 84206.901 |  8638.194
 10000000 | 84846.488 | 16814.890

One thing that really disturbs me in this case is the decision of the
optimizer to generate an Append + Hash starting with offset 1000000
instead of simply continuing with a Merge Append, which pushes down
limits and returns just 10M intermediate rows whereas Append does not -
yet - and results into 270M intermediate rows, resulting these numbers
(enable_hashjoin turned off to force a Merge Append):

 c_offset |  exec_a   |  exec_a_m
----------+-----------+------------
  1000000 | 81325.527 |  16517.566

...but then degrades further because it switches to Append again (no way
to test a Merge Append performance here, I guess):
  5000000 | 84206.901 | 107161.533
 10000000 | 84846.488 | 109368.087

Is there anything I can do about it (apart from my generated huge UNION)
to speed things up?

Please note that I'm using Timescale extension just as a simple way of
managing the partitions and indexes and intentionally set the time
column to a one different from the query filter to not have it optimize
things away under the hood.

Looking forward to any pointers here.

Cheers,

	Ancoron

Refs:
[1] https://stackoverflow.com/questions/55470713
[2] https://paste.ofcode.org/szj7f7fCSYk7jQNdd5Wvbx
[3] https://paste.ofcode.org/ibZ8fNmNFDrsyxa3NktdWB