Postgresql HA questions

Israel Brewster <israel@xxxxxxxxxxxxxx> · Wed, 23 Sep 2015 08:36:36 -0800

I have a couple of questions regarding setting up and administering a high availability Postgresql "cluster". At the moment the setup I have consists of two servers set up with streaming replication using the named replication slots feature of Postgresql 9.4 (currently running 9.4.1). Let me know if more details are needed about the current setup So, on to my questions:
- Is there any way to do zero-downtime updates to Postgres? At first I was thinking I could just update the secondary, fail over to it, update the primary, and fail back, but then I realized that this would be much more complicated than my first thought, since I would have to either a) find a way to point all my apps that use the database to the secondary while updating, or b) re-ip  the machines. Add to that the difficulty with the fail-back, and it seems like it would be a simpler solution to simply take down the primary server (and all my companies apps) for the (assumedly short) duration of the update.

With my application servers, I have a system set up using corosync and pacemaker that allows for seamless fail-over between the two machines, with the IP address and all services moving smoothly between the two at will. Ideally, I would have a similar setup with my database servers, so the applications never even know that there was a switch. Is this possible with Postgresql at all? Does it make a difference that at least one app has an "always on" connection to the DB Server?

- When using named replication slots, I know the primary server will hold on to as many WAL files as the secondary needs to be able to recover to a consistent state, which runs the risk of filling up the hard drive with WAL files if the secondary is off-line for an extended period of time. As such, I have a system set up to monitor the number of WAL files on both the primary and secondary machines. Watching that, I have on occasion seen the number hit as high as 300 some odd files on the secondary, although it always comes back down fairly quickly (on the day I saw the count up at 378, it was back down to 30 within 10 minutes). At the moment I have 13 WAL files, even though my wall_keep_segments parameter is set to 5.  Interestingly (to me at least) I have not seen this same sort of variation on the primary - it generally stays right around 13 files. Is this sort of fluctuation normal, or should I be looking for some sort of an issue?

-What is the best way to monitor that the two servers are actually staying in sync? At the moment, I am looking at the value of the "active" field in the pg_replication_slots table. Is this sufficient to be sure replication is taking place properly, or should I do something like run a query on a frequently updated table and make sure the results are identical?

Thanks for any suggestions and information you can provide!

-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

BEGIN:VCARD
VERSION:3.0
N:Brewster;Israel;;;
FN:Israel Brewster
ORG:Frontier Flying Service;MIS
TITLE:PC Support Tech II
EMAIL;type=INTERNET;type=WORK;type=pref:israel@xxxxxxxxxxxxxxxxxx
TEL;type=WORK;type=pref:907-450-7293
item1.ADR;type=WORK;type=pref:;;5245 Airport Industrial Wy;Fairbanks;AK;99701;
item1.X-ABADR:us
CATEGORIES:General
X-ABUID:36305438-95EA-4410-91AB-45D16CABCDDC\:ABPerson
END:VCARD