Re: force partition pruning

David Rowley <dgrowleyml@xxxxxxxxx> · Wed, 12 May 2021 12:34:20 +1200

On Wed, 12 May 2021 at 06:33, Vijaykumar Jain
<vijaykumarjain.github@xxxxxxxxx> wrote:
>
> ok i think i just may be there is very less data , hence no index scan, no pruning.
>
> when i try to force seq_scan off,
>
> postgres=# set enable_seqscan TO off;
> SET
> postgres=# explain analyze select * from tprt where tprt.col1 in (select tbl1.col1 from tbl1 where tbl1.col2 in (1, 2) );
>                                                             QUERY PLAN
> -----------------------------------------------------------------------------------------------------------------------------------
>  Merge Semi Join  (cost=0.88..20.98 rows=2 width=4) (actual time=0.031..0.042 rows=2 loops=1)
>    Merge Cond: (tprt.col1 = tbl1.col1)
>    ->  Append  (cost=0.75..28.89 rows=7 width=4) (actual time=0.024..0.032 rows=3 loops=1)
>          ->  Index Only Scan using tprt1_idx on tprt_1  (cost=0.13..8.16 rows=2 width=4) (actual time=0.023..0.024 rows=2 loops=1)
>                Heap Fetches: 0
>          ->  Index Only Scan using tprt2_idx on tprt_2  (cost=0.12..4.14 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=1)
>                Heap Fetches: 0
>          ->  Index Only Scan using tprt3_idx on tprt_3  (cost=0.12..4.14 rows=1 width=4) (never executed)
>                Heap Fetches: 0
>          ->  Index Only Scan using tprt4_idx on tprt_4  (cost=0.12..4.14 rows=1 width=4) (never executed)
>                Heap Fetches: 0
>          ->  Index Only Scan using tprt5_idx on tprt_5  (cost=0.12..4.14 rows=1 width=4) (never executed)
>                Heap Fetches: 0
>          ->  Index Only Scan using tprt6_idx on tprt_6  (cost=0.12..4.14 rows=1 width=4) (never executed)
>                Heap Fetches: 0
>    ->  Index Scan using tbl1_col1_idx on tbl1  (cost=0.13..12.16 rows=2 width=4) (actual time=0.006..0.007 rows=1 loops=1)
>          Filter: (col2 = ANY ('{1,2}'::integer[]))
>          Rows Removed by Filter: 1
>  Planning Time: 0.244 ms
>  Execution Time: 0.067 ms
> (20 rows)

Unfortunately, no run-time pruning occurred in the above plan.  The
reason you're seeing (never executed) is that the index scan on
tbl1_col1_idx ran out of rows before the Append read tuples from all
of its children.  If there were rows in tbl1 that had join partners in
tprt_6, then Append would have had to churn through all partitions to
get to the matching row in tprt_6

The fact that the above plan uses Append made that possible.
MergeAppend would have had to read a tuple from each child node to
find the next highest one.   Append can be used in this case as the
schema design means tptr_1 will always have lower values col1 rows
than tptr_2.  Using Append and ensuring the children are in the
correct order is a sort of "ordered partition scan".

As for trying to do what you're trying to make work,  right now
run-time pruning only works when there are parameters that change
during execution for Append and MergeAppend.  So the
Append/MergeAppend would have to be below either a subquery scan or a
nested loop join.  At the moment the planner only creates
parameterized paths for indexes. Likely to make what you want to work
actually work, we'd need to add some sort of concept of parameterized
Seq Scans.  We'd also need to attempt to cost those in a way that
takes into account run-time pruning, otherwise, they'd just appear as
expensive as normal seq scans. It's a bit difficult to how to decide
many rows to estimate in those cases as if your partitions are
unevenly sized then which size do you choose. sum tuples divided by
$something, but I'm not sure what $something would be.

The current situation is that with your example query, you won't get
any run-time pruning with it if the join method uses is Hash or Merge
Join. These join types are not parameterized therefore can't be
run-time pruned.   It might be possible to have a feature that adds
run-time pruning to hash join.  We'd need to run the pruning code each
time we put a value in the hash table, then we'd need to somehow
communicate the matching partitions to the outer side of the join. I'm
not really sure we have executor infrastructure to allow us to pass
that sort of information between nodes. However, it does not seem
impossible.  It does sound like a problem that would be hard to cost
during planning.  Running the partition pruning code once for each row
going into the hash table might be a bit costly and if it prunes
nothing then it was a pretty poor investment. The planner would need
to only enable that when the statistics indicate it appears
worthwhile.  I don't see how it would be possible to run-time prune
during a merge join. We've no way to know which partitions to prune
unless we read the entire other side of the join first. If you do
that, you've just blown all the advantages of merge join.

I think, for now, the only sure way to get run-time pruning working
for this case is to run two separate queries so that the 2nd one can
perform plan-time pruning.   You might have some luck by disabling
hash and merge joins and having an index on your partition key. You
might further increase the chances of the planner picking a nested
loop plan with the partition table on the inner side of you add
DISTINCT to the subquery. That way the planner does not have to
consider the aggregate step that it needs to do in order to swap the
join order to put the partitioned table on the inner side.  Otherwise,
the semi-joined table cannot go on the outer side as duplicate records
would cause wrong results.   Even if you take all those steps then
it's still not great as we have no parameterized seq scans, so you'd
have a less efficient plan as you'd have to scan all data in all
partitions using an index scan, which is not very cache efficient.
Even with all of that, you're still at the mercy of the planner not
putting the partition on the inside of the nested loop, or it using a
normal nested loop rather than a parameterized one.

I think if you try to make this work by trying to force the planner's
hand, you'll just feel pain when the planner one day has a change of
heart and decides to swap the join order on you.

David