Re: Backup "Best Practices"

Israel Brewster <israel@xxxxxxxxxxxxxx> · Wed, 23 Nov 2016 10:20:29 -0900

-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

BEGIN:VCARD
VERSION:3.0
N:Brewster;Israel;;;
FN:Israel Brewster
ORG:Frontier Flying Service;MIS
TITLE:PC Support Tech II
EMAIL;type=INTERNET;type=WORK;type=pref:israel@xxxxxxxxxxxxxxxxxx
TEL;type=WORK;type=pref:907-450-7293
item1.ADR;type=WORK;type=pref:;;5245 Airport Industrial Wy;Fairbanks;AK;99701;
item1.X-ABADR:us
CATEGORIES:General
X-ABUID:36305438-95EA-4410-91AB-45D16CABCDDC\:ABPerson
END:VCARD

On Nov 23, 2016, at 9:42 AM, John R Pierce <pierce@xxxxxxxxxxxx> wrote:

On 11/23/2016 10:16 AM, Israel Brewster wrote:

To mitigate these issues, I am thinking of supplementing the individual dumps with a full base backup and WAL archiving to a NFS share. This should enable (relatively) quick/easy recovery from backup, plus the ability to do PIT Recovery. I do have a few questions with this approach, however:

if you do the base backup periodically (daily? weekly?   monthly? kinda depends on your workload, data volume, and available space on your WAL archive server)

- How do I figure out how often I should take a full base backup? I know this will depend on the amount of traffic my database is doing, and how long I am willing to wait for WAL files to be replayed - the more WAL files needing replayed, the longer recovery will take - but is there some rule of thumb that I can use to calculate how often I need a new base backup? Perhaps based on the number of WAL files?

frequency of doing this is entirely dependent on your data volume, how long the backups take, and your tolerance for restore times. wal archive recoveries are generally quite fast, but of course if there's millions of wal files to recover, that would take a long time.      note, too, 9.6 has some major performance enhancements in how checkpoints are written, which should speed things up overall.

Backups (using pg_basebackup to a tar file, gziped) take about 45 minutes. Part of the goal here is to minimize restore time in a disaster recovery scenario, so *fairly* low tolerance for restore times (say, after getting the base backup in place, another 5 minutes or less to get it up-and-running). My difficulty is that I don't know if that translates into two WAL files or two thousand. But that was really my question - is there some way to know how many WAL files are too many, and by extension how often I need to do a new base backup? Obviously how quickly I get to that point of "too many" WAL files could vary widely, but if I could get some idea of what that line is, or at least how to figure out that line, that would really help with planning.

From what you said, it sounds like I can go for a while, and allow quite a few WAL files to pile up (especially with 9.6, which I will be moving to shortly) between base backups without creating too large a recovery delay. So I'm thinking I may try weekly or even monthly base backups and see how that goes.

- What is the "best" (or just a good) method of keeping the WAL archives under control? Obviously when I do a new basebackup I can "cleanup" any old files that said backup doesn't need, but how do I know what those are?
- Should I be looking at any other backup methods in addition to/instead of the basebackup/WAL archive scheme?

I would keep at least 2 base backups and all wal files since the start of the oldest base backup.    

Good thinking. I'll look into implementing that.

when you create a new base backup, delete the oldest, and all wal archive files up to the start of the following one. 

How do I determine which those are? Just based on the timestamp if the WAL file, such that I could do something like take the timestamp of the last basebackup and delete all WAL files older than that? Or is there a better way?

    the reason for keeping at least two is so you always have a recovery point if something catastrophic happens during the backup process.

note that having base backups plus wal archives allows PiTR too, point in time recovery.   say you discover a bad script updated a ton of bad data into your database last tuesday at 3pm.  you can restore the preceding base backup, then recover up to tuesday just before this event.

Exactly - that's one of the primary reasons I'm taking this approach.

Thanks again for the info!

-- 
john r pierce, recycling bits in santa cruz

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general