FDW join vs full join push down

Marc Olivé <marc.olive@xxxxxxxx> · Tue, 7 Sep 2021 16:02:03 +0200

Hello all,
We have some PG servers which we merge into a "coordinator" node using FDW and partitioned tables, we partition them by a synthetic "shard_id" field.
There are around 30 tables coordinated this way, with all foreign servers having the same schema structure.

We have some performance issues when joining foreign tables, always done by the same "shard_id", where the major bottleneck is how rows from joined tables are fetched. explain(verbose) shows:

Remote SQL: SELECT entity_id, execution_id, shard_id FROM entity_execution WHERE ((shard_id = 5)) AND (($1::bigint = entity_id))

This way, PG is doing a lot of round trips between the coordinator and the foreign nodes, fetching a single row every time, and we have a very high latency between the coordinator and the nodes.

As the joins are done on the same node, it could send the whole query and fetch all results in a single round trip.

The FDW are configured with 'use_remote_estimate' to true and we have the parameters enable_partition_pruning, enable_partitionwise_aggregate and enable_partitionwise_join activated.
The tables involved can have from a million rows to more than 1000 millions, but the queries usually return a few thousand rows.

A full sample plan and it's query: https://explain.depesz.com/s/TbJy
explain(verbose)
select *
from nlp.note_entity_label nel
join nlp.note_entity ne on ne.note_entity_id = nel.note_entity_id and ne.shard_id = nel.shard_id
join nlp.note_entity_execution nex on nex.note_entity_id = ne.note_entity_id and nex.shard_id = nel.shard_id
where
    nel.label_id = 192
    and nel.shard_id = 5

The row estimates are quite off the true ones, even though we have run 'analyze' on the remote nodes before, and 'use_remote_estimate' is on.
The above query ends in about 6 minutes.

The interesting part is that if we change the 'join' by 'full joins', with some extra filter, the plan is the one we believe is the optimal one, and indeed the query ends in 1 second: https://explain.depesz.com/s/b3As

explain(verbose)
with ents as(
    select nel.note_entity_id nelid, ne.note_entity_id neid, nex.note_entity_id nexid, *
    from nlp.note_entity_label nel
    full join nlp.note_entity ne on ne.note_entity_id = nel.note_entity_id and ne.shard_id = nel.shard_id
    full join nlp.note_entity_execution nex on nex.note_entity_id = ne.note_entity_id and nex.shard_id = nel.shard_id 
    where
        nel.label_id = 192
        and nel.shard_id = 5
)
select *
from ents
where nelid is not null
    and neid is not null
    and nexid is not null
;
Here we can see that the whole query is sent to the fdw and it finishes in a reasonable time.

So, the question is if we can do something to make the fdw send the whole query to the remote nodes when the involved joins use the same partition, or why isn't PG sending it when we use 'inner join'.
We have tried tweaking the "fdw_tuple_cost" , increasing and lowering it to unreasonable values
10, 1000, 100000 and 1000000 without the desired result.

Thanks,