Logical Replication for Very Large Databases

Taylor Sarrafian <surprisetalk@xxxxxxxxx> · Mon, 8 Jul 2019 17:15:23 -0700

Hi friends,
We have a relatively large database (~5TB) receiving very frequent INSERTs/UPDATEs, and we’d like to create a logical replica of it. To put things in perspective, we often generate up to 10 WAL-segments (16MB each) per second (wal_level=replica).

We successfully set up logical replication in our smaller staging DB, but the scale of production is making things difficult.

---

Here are some approaches we considered:

1. `CREATE SUBSCRIPTION … WITH ( copy_data=true )` 

This would be ideal, except we’re afraid of accumulating too much WAL on the publishing server while waiting for ~5TB to copy over. We could theoretically increase the size of the drive to something like 10TB, but that’s not currently an option for us.

There’s also the issue of performance. It’s unclear what impact `copy_data=true` would have on our DB read/write throughput, and we’re hesitant to try it.

2. Start from a base-backup and catch-up with `restore_command`:

`SELECT pg_replication_slot_advance( pg_current_wal_lsn() )` could work nicely with a matching `recovery_target_lsn` in `recovery.conf`.

We already use `archive_command` with `wal-g` to store WAL and base-backups to GCS. Unfortunately, by the time our base-backup is downloaded onto the replica, it’s already at least 6 hours old. And restoring with `restore_command` runs at a similar speed to `archive_command`, so it remains ~5-8 hours behind as long as it’s in standby.

Our postgres process isn’t limited by CPU, memory, network, or disk-speed on the replica. It doesn’t seem to be working at its full capacity, which suggests a configuration issue.

Please look at our configurations below and let us know if there are any options we could tweak in `postgresql.conf` to speed up the WAL ingestion for `restore_command`.

3. Start from a base-backup and catch-up with the `pg_wal` directory:

We could write a script to grab WAL-segments from GCS and place them into the `pg_wal` directory. This is what `restore_command` is technically doing, but we might be able to squeeze some efficiency out of a custom script. Maybe.

—

Our server has 32 cores and 120GB of memory with the following selected configs:
```
shared_buffers = 30GB
huge_pages = try
maintenance_work_mem = 1GB
max_files_per_process = 2000
effective_io_concurrency = 100
max_worker_processes = 32
max_parallel_maintenance_workers = 2
max_parallel_workers_per_gather = 0
parallel_leader_participation = on
max_parallel_workers = 32
wal_level = archive
wal_buffers = 16MB
checkpoint_timeout = 30min
max_wal_size = 100GB
min_wal_size = 1GB
checkpoint_completion_target = 0.9
max_wal_senders = 20
wal_keep_segments = 10
effective_cache_size = 90GB
```

Thanks for all the help!

ts