FDW, too long to run explain

Vijaykumar Jain <vjain@xxxxxxxxxxxxx> · Mon, 4 Feb 2019 12:44:14 +0530

Hi,
with pg v10.1

I have a setup enabled as below.
7 shards ( 1RW,  2 RO )
they all are fronted by FDW talking to each other.

we use writes directly to shards, and reads via FDW from all shards (RO)
our DB size is ~ 500GB each shard, and tables are huge too.
1 table ~ 200GB, 1 ~55GB, 1 ~40GB and a lot of small tables, but large indices on large table.

the sharding was done based on a key to enable shard isolation at app layer using a fact table.
select id,shard from fact_table;

server resources are,
32GB mem, 8 vcpu, 500GB SSD.

the FDW connect to each other shard via FDW fronted by haproxy -> pgbouncer -> postgresql.
Hope this is good enough background :)

now we have some long running queries via FDW that take minutes and get killed explain runs as idle in transaction on remote servers. (we set use_remote_estimate = true )
when the query is run on individual shards directly, it runs pretty quickly, 
but when run via FDW, it takes very long.
i even altered fetch_sie to 10000, so that in case some filters do not get pushed, those can be applied on the FDW quickly.

but i am lost at the understanding of why explain runs for ever via FDW.
we have a view on remote servers. we import public schema from remote servers, into coordinator custom schema, and then union all

select * from (
select * from sh01.view1
union all
select * from sh01.view1
...
) t where t.foo = 'bar' limit 10;

now the explain for 
select * from sh01.view1  keeps running for minutes sometimes,

then fetch too keeps running for minutes, although the total rows are < 10000 maybe.
idle in transaction | FETCH 10000 FROM c1

we have very aggressive  settings for autovacuum and auto analyze.

autovacuum_naptime = '15s'
autovacuum_vacuum_scale_factor = '0.001'
autovacuum_analyze_scale_factor = '0.005'
log_autovacuum_min_duration = '0'
maintenance_work_mem = '2GB'
autovacuum_vacuum_cost_limit = '5000'
autovacuum_vacuum_cost_delay = '5ms'

other questions:
also, what is the cost of fetch_size?
we have in our settings => use_remote_estimate=true,fetch_size=10000

I mean given we have a query

select * from foobar limit 10000; via FDW
limit 10000 does not get pushed.
so it seems all rows some to FDW node and then limit is applied?

i currently do not have the queries, but i have a screenshot for long running explain via FDW.
also since the whole query does not show up in pg_stat_statement, i am not sure, that would be of great help since predicate although applied, do not show up in pg_stat_activity.

I know, there can be more info i can provide, but  if anyone has experienced this, pls let me know.

BTW, i know citus is an option, but can we keep that option aside.

we see better ways to handle this in future, by sharding on ids and further partitioning of tables and parallel execution of FDW queries, but we need to know if this is a known issue of pg10 or i am doing something wrong which will bite in pg11 too.

Appreciate your help, always.

Regards,Vijay