On 06.11.23 20:26, Matthias Leisi wrote:
Dear all,
I’m reasonably experienced with Postgres with simple (single, „rebuild
and restore“) requirements, but would need some hints on what to look
for in a more complex situation - deploying Postgres as the backend for
a (virtual) appliance.
This appliance can scale horizontally from a single to dozens of VMs
(theoretically more, but most installations are small-ish). It is
feasible to configure VMs for particular purposes (eg „you are [also] a
DB node“), but basically all instances will/should be able to perform
their main tasks besides (also) being a DB node. As the VMs may be
installed in very different environments, network-based solutions are
less feasible and we would prefer a DB-level solution. We assume that
for most cases, primary/stand-by configurations would be sufficient in
terms of availability / latency / throughput.
We must also assume that there is no person who would be able to touch
things if an error occurs. Data consistency and (as much as possible)
automated recovery from error situations („VM down“, „network lost“, …)
are therefor more important than „n nines". We can assume that the VMs
can talk to each other over TCP (eg using SSH tunnels, direct Postgres
connection, or some other suitable protocol). Scripting „around“ the
database is available to initialize instances and for similar tasks.
Would Postgres’ own log-shipping (file-based + streaming replication,
possibly with remote_write) be sufficient for such a set of requirements?
What aspects would you consider important for such a scenario?
The replication that ships with Postgres gives you one writeable primary
server and a number of standbys, but it has no tools to automatically
discover or recover from failure. From
https://www.postgresql.org/docs/current/warm-standby-failover.html:
---
PostgreSQL does not provide the system software required to identify a
failure on the primary and notify the standby database server. Many such
tools exist and are well integrated with the operating system facilities
required for successful failover, such as IP address migration.
Once failover to the standby occurs, there is only a single server in
operation. This is known as a degenerate state. The former standby is
now the primary, but the former primary is down and might stay down. To
return to normal operation, a standby server must be recreated, either
on the former primary system when it comes up, or on a third, possibly
new, system. The pg_rewind utility can be used to speed up this process
on large clusters. Once complete, the primary and standby can be
considered to have switched roles. Some people choose to use a third
server to provide backup for the new primary until the new standby
server is recreated, though clearly this complicates the system
configuration and operational processes.
---
So you need to add additional parts from the ecosystem to detect
failure, handle failover, potentially move an IP address with the
Primary etc. Popular tools are repmgr, Patroni, CloudNativePG, BDR,
pacemaker+corosync and endless others. They will address many of your
requirements, but still some work and understanding is required to make
them fully unattended, as well as have them expand dynamically with new
replicas if an appliance is added.
Postgres is an amazing product and I like to use it for almost
everything, but in this scenario we are of course making our live hard
with ACID compliance. In NoSQL/"eventual consistency" land, there are
products that are a lot friendlier to a setup like this - stuff like
Cassandra, etcd, CouchDb comes to mind. I'd compare the pros and cons of
such alternatives, the big con of course being a lot less consistency
and durability guarantees - the question is whether the application
needs it.
But enough about non-Postgres topics on this list :) To go with
Postgres, on a hunch I'd try Patroni first: it does a lot of the
advanced failover stuff, has a great track record, and supposedly runs
on BSD: https://openports.pl/path/databases/patroni
Cheers
Christian
--
Christian Ramseyer, netnea ag
Network Management. Security. OpenSource.
https://www.netnea.com
Phone: +41 79 644 77 64