We're considering a change to our backup strategy to reduce WAN bandwidth and archival storage space. I'm posting a description here both because it may be of value to someone else, and because someone might be able to poke a hole in it before we go ahead with implementation. All servers are Linux (SuSE Enterprise 10). Besides the PITR backups, we have some additional safety nets, which I'll describe up front. These are not changing. We have 72 counties which each have a database server holding the official record for that county. Our application framework passes transaction images captured by our middle tier on each county database server to a process which replicates to four central databases, each of which holds a copy of all counties. This process detects idle time on the servers and uses it to run a sync process to confirm that the central copies contain accurate replicas of the counties, reporting any differences and correcting the central databases to match the counties. It also saves the transaction images to two transaction repositories where they are kept for a little over a year. Our current PostgreSQL backup strategy is: (1) Our archive script copies WAL files to a directory on the database server, using cp to a one directory followed by mv to another (to prevent partial files from being processed). (2) We have a crontab job on each county database server to push WAL files to a backup server in the county using rsync. (3) We have a crontab job on our central backup server to pull the WAL files from the county database servers using rsync. (4) We do a PITR base backup once per week on each county database server, using cpio piped to gzip and split, saving the backup on the database server. (5) We have a crontab job on each county database server to push base backup files to a backup server in the county using rsync. (6) We have a crontab job on our central backup server to pull the base backup files from the county database servers using rsync. (7) We have one warm standby per county on our central backup server which applies the WAL files as they arrive. These are monitored to ensure that they are running, in recovery mode, and are reasonably current -- to detect problems with transfer of the WAL files. (8) We have a crontab job which notices the arrival of a new base backup, and uses it for a new basis for the related warm standby, staring that automatically in place of the prior one. (9) We have crontab scripts which detect when we have received a new base backup from a county which has successfully come up in warm standby mode and is the first base backup for that county for the month. It captures that base backup, and the WAL files required to start it (based on the .backup file) to a mirrored SAN archive volume, for one year retention. This runs well with little human intervention, but we're having problems with both the WAN network bandwidth and the space needed for the monthly archives. Our solution is to abandon compression, instead using hard links and rsync. The steps involving WAL files would not change, but most things involving PITR base backups would do a recursive hard link copy of a the previous backup, followed by an rsync from the new backup image. We should get our WAN traffic for base backups down to a fraction of its current volume, and the hard links look to save considerably more space than the gzip compression. For (4) we will use 'cp -rl' to copy the previous backup to a new directory, then rsync (with --delete --include='/pg_xlog/archive_status/' --exclude='/pg_xlog/*') from the production database to the hard link copy. For (5) we'll do a hard link copy on the backup server before the rsync of the directory tree for the base backup. For (6) we will do a hard link recursive copy from the previous backup, rsync from the warm standby, then rsync from the county backup. We expect to be bringing back a pretty minimal delta change set based on the differences between the warm standby and the recently completed backup of the county database, especially since these will normally be running off-hours on a weekend. For (8) we will do a full copy (no links) from the new backup image to get the starting point for the new warm standby. For (9) we will do a recursive hard link copy of the previous month's base backup and rsync the new month's backup onto it. We realize that if one of the linked files is damaged, it can affect some or all archival backups for that county. That has been deemed an acceptable risk. Comments? -Kevin -- Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-admin