data consolidation: logical replication design considerations

Rory Campbell-Lange <rory@xxxxxxxxxxxxxxxxxx> · Sat, 16 Jul 2022 17:07:09 +0100

I'd be grateful for some comments on the advisability of using a large
number of concurrent logical replication publications/subscriptions.
Below I've set out the current environment and a suggested design.
Apologies for the length of this email.

We presently have many hundreds of small databases in a cluster, in
separate databases due to historical issues, security considerations and
the ability to scale horizontally in future by using more than one
cluster. The cluster is presently less than half a terabyte in size and
runs very comfortably on a 96GB RAM/32 core Intel E5-2620 server on NVMe
disks in RAID10 configuration on Linux.

The individual databases cover a handful of discrete services (although
they have some common data structures) and are of different sizes
depending on client needs. The largest client database is currently
about 7.5GB in size.

We presently use streaming replication locally and remotely to replicate
the cluster and it has pg_bouncer in front of it. Some settings:

    max_connections: 500
    shared_buffers: 20GB
    work_mem: 15MB

Due to changing client and operational requirements we need to aggregate
some common data between client databases in the same organisation into
single read-only databases for reporting purposes. This new requirement
is in addition to keeping the client databases in operation as they are
now. The potential for using logical replication comes to mind,
specifically the use case of "Consolidating multiple databases into a
single one" mentioned at
https://www.postgresql.org/docs/current/logical-replication.html

Some tests suggest that we can meet the requirements for publication
table replica identity and safe aggregation of data.

At an overview level this consolidation might require the setup of
logical replication publications from say 500 client databases
aggregating in close to real time to 50 target or aggregation
subscribing databases, averaging roughly 10 client database per
aggregation database, but with some aggregation databases having roughly
50 clients.

I would be grateful for comments on the following design proposals:

* to avoid overloading the existing primary host with many replication
  slots, it would be wise to implement aggregation on another host 

* the new aggregation host should receive streaming replication data
  from the primary on a first postgresql instance which will also have
  logical replication publishers on each relevant client database. As
  noted above, there may be ~500 publications

* the new aggregation host would have a second postgresql instance
  serving the aggregation databases each acting as logical
  replication subscribers. As noted above, there would be ~50 target
  databases each with an average of ~10 subscriptions to the first
  postgresql instance.

* that only one subscription per client database is needed (after
  initial synchronisation) to synchronise all tables in a particular
  client database schema

* that publications and subscriptions are brought online on a per-client
  database basis, to reduce the number of replication slots required due
  to initial synchronisation (the docs aren't clear about how many
  temporary replication slots may be needed "for the initial data
  synchronisation of pre-existing table data"; see
  https://www.postgresql.org/docs/current/logical-replication-subscription.html)

* that a similar server to the one noted above can handle two postgresql
  instances as described together with ~250 concurrent client
  connections to the second instance to serve client reporting needs.

Thoughts gratefully received,
Rory