Search Postgresql Archives

Logical replication fails when adding multiple replicas

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello list,


We’re having some issues with Postgresql’s logical replication. Specifically trying to add several replicas at once. Essentially we can add replicas one at a time, but when we try and add two or more together some of the table subscriptions (as described in pg_subscription_rel) fail to get to ‘STATE_READY’ (‘r’). 

Here’s a bit more detail.


How the problem manifests/what we observe

When we try to add several replicas at the same time most of them have at least some tables/subscriptions (in the pg_subscription_rel) that get stuck in STATE_FINISHEDCOPY and never reach STATE_READY. The tables have the right number of rows, but the logs show postgres is in a constant cycle of errors like: 


2023-03-18 16:00:19.974 UTC [4207] LOG:  logical replication table synchronization worker for subscription "polling_stations_01c7ee00f17021f94", table "uk_geo_utils_onspd" has started

2023-03-18 16:00:19.999 UTC [4207] ERROR:  could not start WAL streaming: ERROR:  replication slot "pg_37982_sync_37918_7210774007126708177" does not exist


This is more likely to happen with larger tables, but can happen with any we are replicating. 


An example of the logs we’re seeing on the server at the same time are: 


2023-03-18 16:00:20 UTC:172.31.14.73(49884):postgres@polling_stations:[5952]:LOG: 00000: received replication command: START_REPLICATION SLOT "pg_37982_sync_37702_7210774007126708177" LOGICAL 187/E00020E8 (proto_version '2', streaming 'on', publication_names '"alltables"', binary 'true')

2023-03-18 16:00:20 UTC:172.31.14.73(49884):postgres@polling_stations:[5952]:LOCATION: exec_replication_command, walsender.c:1683

2023-03-18 16:00:20 UTC:172.31.14.73(49884):postgres@polling_stations:[5952]:STATEMENT: START_REPLICATION SLOT "pg_37982_sync_37702_7210774007126708177" LOGICAL 187/E00020E8 (proto_version '2', streaming 'on', publication_names '"alltables"', binary 'true')

2023-03-18 16:00:20 UTC:172.31.14.73(49884):postgres@polling_stations:[5952]:ERROR: 42704: replication slot "pg_37982_sync_37702_7210774007126708177" does not exist

2023-03-18 16:00:20 UTC:172.31.14.73(49884):postgres@polling_stations:[5952]:LOCATION: ReplicationSlotAcquire, slot.c:402


When we add a single replica at a time, the replica starts up without any problems. 


What we’ve tried


Essentially we have tried fiddling with lots of different conf settings. I’ve included these below showing the setting, the value we’re using, and a brief sentence about why. 


What we think is going on

We don’t know - hence the question :). However it is something resulting from concurrent replication, and as such we assume it’s to do with how we have configured the primary, or how Postgres handles logical replication, rather than an issue with how we’re configuring the subscriber.


Some general context and observations


I’d be really grateful for any pointers, or even similar experiences as we’ve been banging our heads against this for a few days now! 


Many thanks

Will


Publisher settings


Max_replication_slots: 100

We don’t expect to see more than this, and haven’t ever seen anything more than ~30 when scaling up to 10 instances (2 table sync workers per instance and one slot made as part of the subscription)


Max_connections: 100

Not expecting more than this to the RDS, have never seen problems with too many connections


Work_mem: 512

Make sure we don’t run out of this. System memory isn’t filling up at all


Shared_buffers: {DBInstanceClassMemory/20480}

40% of memory



Max_logical_replication_workers: 200,

Max_replication_slots: 400,

Max_worker_processes: 200

Just a higher number than we expect to get to make sure this isn’t what we’re hitting


Max_slot_wal_keep_size: 5000

Satey to prevent the RDS from filling up with unsynced WALs. We’ve never hit this limit, mainly because we don’t write a lot at all, and instances clean up their slots on shutdown.


Max_wal_size: 1028
(default postgres value, not sure what the RDS default is)


Rds.logical_replication: 1

Enables replication in RDS. RDS only setting


Shared_preload_libraries

Enables replication in RDS


Checkpoint_timeout: 1200

20 minutes. Set to see if a timeout was happening during the COPY phase (this takes less than 20 minutes)


Deadlock_timeout: 30000

Raised from 1 second to 30 to see if deadlock checking was slowing down or making any race conditions worse.



Log_connections: 1, 

log_disconnections: 1, 

log_duration: 1, 

Log_executor_stats: 1,

Log_lock_waits: 1, 

Log_replication_commands: 1

Log_error_verbosity: verbose, 

RDS debugging



Max_locks_per_transaction: 128

Doubled to see if this was being hit for some reason


 

Max_slot_wal_keep_size: 40000,

Max_wal_senders: 400,

Max_wal_size: 512,

Wal_keep_size: 8192,

Wal_sender_timeout: 12000

Trying to change various things with WAL settings, not really knowing why 🙁



Tcp_keepalives_idle: 60

Thinking the network connection might be getting killed when syncing large tables. Unlikely as it takes about the same time to sync a single instance and that works fine.


Subscriber/Instance settings

Wal_receiver_create_temp_slot: 1

Primary_slot_name: same as subscription

Wal_receiver_timeout: 1200000

(plus more as per https://github.com/DemocracyClub/UK-Polling-Stations/blob/master/deploy/files/scripts/db_replication_runner.sh)







[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux