On Fri, 5 Jan 2018 13:07:10 -0600 Azimuddin Mohammed <azimeiu@xxxxxxxxx> wrote: > Hello, > I am little confused with how HA works in postgres. Reading the article > which state as below "*If the primary server fails and the standby server > becomes the new primary, and then the old primary restarts, you must have a > mechanism for informing the old primary that it is no longer the primary. > This is sometimes known as STONITH (Shoot The Other Node In The Head), > which is necessary to avoid situations where both systems think they are > the primary, which will lead to confusion and ultimately data loss.* > > *Many failover systems use just two systems, the primary and the standby, > connected by some kind of heartbeat mechanism to continually verify the > connectivity between the two and the viability of the primary. It is also > possible to use a third system (called a witness server) to prevent some > cases of inappropriate failover, but the additional complexity might not be > worthwhile unless it is set up with sufficient care and rigorous testing.* > *PostgreSQL does not provide the system software required to identify a > failure on the primary and notify the standby database server. Many such > tools exist and are well integrated with the operating system facilities > required for successful failover, such as IP address migration."* > > Can someone explain how the HA failback will take place The failback need either to rebuild the old master as a standby (rsync, pg_basebackup, restore PITR, ...) or to use pg_rewind to rewind the old master to a point where it can catch up with the new master. Some tools tries to automate failback using pg_rewind (patroni, repmgr), but I have no experience with them. > and what open source tools we can use to make sure once the primary server > which failed over to slave will mark itself as slave. There's a lot of open source tools to build some HA around PgSQL: Repmgr, Patroni (based on etcd or zookeeper), PAF (based on Pacemaker), etc. You will have to spend a lot of time to make extensive tests, understand them, pick one and document your cluster. Regards,