On Wed, 12 May 2021 at 06:33, Vijaykumar Jain <vijaykumarjain.github@xxxxxxxxx> wrote: > > ok i think i just may be there is very less data , hence no index scan, no pruning. > > when i try to force seq_scan off, > > postgres=# set enable_seqscan TO off; > SET > postgres=# explain analyze select * from tprt where tprt.col1 in (select tbl1.col1 from tbl1 where tbl1.col2 in (1, 2) ); > QUERY PLAN > ----------------------------------------------------------------------------------------------------------------------------------- > Merge Semi Join (cost=0.88..20.98 rows=2 width=4) (actual time=0.031..0.042 rows=2 loops=1) > Merge Cond: (tprt.col1 = tbl1.col1) > -> Append (cost=0.75..28.89 rows=7 width=4) (actual time=0.024..0.032 rows=3 loops=1) > -> Index Only Scan using tprt1_idx on tprt_1 (cost=0.13..8.16 rows=2 width=4) (actual time=0.023..0.024 rows=2 loops=1) > Heap Fetches: 0 > -> Index Only Scan using tprt2_idx on tprt_2 (cost=0.12..4.14 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=1) > Heap Fetches: 0 > -> Index Only Scan using tprt3_idx on tprt_3 (cost=0.12..4.14 rows=1 width=4) (never executed) > Heap Fetches: 0 > -> Index Only Scan using tprt4_idx on tprt_4 (cost=0.12..4.14 rows=1 width=4) (never executed) > Heap Fetches: 0 > -> Index Only Scan using tprt5_idx on tprt_5 (cost=0.12..4.14 rows=1 width=4) (never executed) > Heap Fetches: 0 > -> Index Only Scan using tprt6_idx on tprt_6 (cost=0.12..4.14 rows=1 width=4) (never executed) > Heap Fetches: 0 > -> Index Scan using tbl1_col1_idx on tbl1 (cost=0.13..12.16 rows=2 width=4) (actual time=0.006..0.007 rows=1 loops=1) > Filter: (col2 = ANY ('{1,2}'::integer[])) > Rows Removed by Filter: 1 > Planning Time: 0.244 ms > Execution Time: 0.067 ms > (20 rows) Unfortunately, no run-time pruning occurred in the above plan. The reason you're seeing (never executed) is that the index scan on tbl1_col1_idx ran out of rows before the Append read tuples from all of its children. If there were rows in tbl1 that had join partners in tprt_6, then Append would have had to churn through all partitions to get to the matching row in tprt_6 The fact that the above plan uses Append made that possible. MergeAppend would have had to read a tuple from each child node to find the next highest one. Append can be used in this case as the schema design means tptr_1 will always have lower values col1 rows than tptr_2. Using Append and ensuring the children are in the correct order is a sort of "ordered partition scan". As for trying to do what you're trying to make work, right now run-time pruning only works when there are parameters that change during execution for Append and MergeAppend. So the Append/MergeAppend would have to be below either a subquery scan or a nested loop join. At the moment the planner only creates parameterized paths for indexes. Likely to make what you want to work actually work, we'd need to add some sort of concept of parameterized Seq Scans. We'd also need to attempt to cost those in a way that takes into account run-time pruning, otherwise, they'd just appear as expensive as normal seq scans. It's a bit difficult to how to decide many rows to estimate in those cases as if your partitions are unevenly sized then which size do you choose. sum tuples divided by $something, but I'm not sure what $something would be. The current situation is that with your example query, you won't get any run-time pruning with it if the join method uses is Hash or Merge Join. These join types are not parameterized therefore can't be run-time pruned. It might be possible to have a feature that adds run-time pruning to hash join. We'd need to run the pruning code each time we put a value in the hash table, then we'd need to somehow communicate the matching partitions to the outer side of the join. I'm not really sure we have executor infrastructure to allow us to pass that sort of information between nodes. However, it does not seem impossible. It does sound like a problem that would be hard to cost during planning. Running the partition pruning code once for each row going into the hash table might be a bit costly and if it prunes nothing then it was a pretty poor investment. The planner would need to only enable that when the statistics indicate it appears worthwhile. I don't see how it would be possible to run-time prune during a merge join. We've no way to know which partitions to prune unless we read the entire other side of the join first. If you do that, you've just blown all the advantages of merge join. I think, for now, the only sure way to get run-time pruning working for this case is to run two separate queries so that the 2nd one can perform plan-time pruning. You might have some luck by disabling hash and merge joins and having an index on your partition key. You might further increase the chances of the planner picking a nested loop plan with the partition table on the inner side of you add DISTINCT to the subquery. That way the planner does not have to consider the aggregate step that it needs to do in order to swap the join order to put the partitioned table on the inner side. Otherwise, the semi-joined table cannot go on the outer side as duplicate records would cause wrong results. Even if you take all those steps then it's still not great as we have no parameterized seq scans, so you'd have a less efficient plan as you'd have to scan all data in all partitions using an index scan, which is not very cache efficient. Even with all of that, you're still at the mercy of the planner not putting the partition on the inside of the nested loop, or it using a normal nested loop rather than a parameterized one. I think if you try to make this work by trying to force the planner's hand, you'll just feel pain when the planner one day has a change of heart and decides to swap the join order on you. David