Hi,
We are facing a problem in setting up a database backup for our production server using pg-standby.
We are using Postgres 8.3.5 in our production environment. We have an active user-base of 2 million and need to support 5000 TPS. The service needs to be up and running 24x7. The production database has a size of 200GB and growing and is expected to reach 1TB in a couple of months.
In the past we have tried various strategies for db-backup including pg-dump and Warm-backup using WAL shipping.
But our current needs demand a Hot-backup which can be set-up without stopping the database server. For this we have evaluated pg-standby with WAL shipping in a test-environment that has continous db inserts. The set-up works fine and the slave (pg-standby) is able to consume WAL logs generated by the master.
We tried to replicate the scenario of the master db going down in this test environment in the following fashions:
a. Killing the master postgres server (using kill command).
b. Properly stopping the master postgres server (using pg_ctl command).
In both the above cases we saw that after the slave has consumed all the generated WAL logs and the recovery completes, the slave lags the master by a set of records. Additionally, the missing records don't seem to be a consecutive set; a random set (non-consecutive) of records is missing in the slave as compared with the master (by bringing the master back up).
Following are the steps that we have used for creating the pg-standby slave:
1. SLAVE: Ensure that postgres is not running on the slave. Stop it if it is running.
Create and INIT a new data directory on the slave (/usr/local/pgsql/bin/initdb -D /usr/local/pgsql/test_data).
2. SLAVE: Ensure that the slave can pull wal-logs from the MASTER. Authorize the SLAVE on the MASTER if needed.
3. MASTER: The archive command should be enabled on the master server.
[/usr/local/pgsql/data/postgresql.conf]
===================================
archive_mode = on # allows archiving to be done
# (change requires restart)
archive_command = '/usr/local/pgsql/bin/verseSQLs/storeWAL.sh %p %f < /dev/null'
4. MASTER: Ensure that the consumeWAL.sh has WAL storage off. This would automatically be turned on by the backup_rsync.sh script (see stemp 6).
5. SLAVE: The wal-consumption should be enabled on the slave server cron. The consume_wals.sh script pulls a set of WALs in the shared storage on the master, so that these WALs can be later consumed on the slave.
#-------------------------------------------------------------------
# [CONSUME WALs from MASTER : VERY CRUCIAL]
#-------------------------------------------------------------------
* * * * * cd /usr/local/pgsql/bin/verseSQLs/consume_wal_scripts && ./consume_wals.sh
6. MASTER: Run backup_rsync.sh.
It does the following:
a. Enables WAL storage in the storeWAL.sh script.
b. Issues a pg_start_backup() to postgres on master.
c. CD to the data dir of master. rsyncs the contents of the data directory of the master to the slave.
rsync -avz --delete --exclude=backup_label --exclude=.* --exclude=*.pid --exclude=*.conf --exclude=pg_xlog --exclude=pg_tblspc --exclude=pg_log/* --copy-unsafe-links . postgres@<SLAVE_IP>:/usr/local/pgsql/test_data
d. Issues a pg_stop_backup() to postgres on master.
e. Cleans pg_xlog on slave.
7. SLAVE: Copy recovery.conf to the data dir. Start POSTGRES server. It starts in STANDBY mode.
The above steps set up the slave server using pg-standby. Note that during all this process the master db is up and there are continuous batch INSERTs happening in the db by an automated process. Also note that this automated process is the only process interacting with the db and there are no other CRUD operations besides these batch INSERTs.
Further, in order to test the scenario for the master going down follow these steps:
8. MASTER: If you want to recreate the scenario of the master db going down, either kill the postgers process or stop the postgres server using pg_ctl.
9. MASTER: Disable the automated process that generates continuous batches of INSERTs in the db
10. SLAVE: Once you have verified that there are no unconsumed WAL logs left, trigger the pg-standby that the recovery has completed (touch the trigger file expected by the pg-standby command specified in recovery.conf).
11. SLAVE: Once pg-standby sees the trigger file, the recovery completes and the postgres server comes up in the production mode.
12. MASTER: Ensure that the storeWAL.sh has WAL storage off. Though by now the slave is in a non-recovery mode, this step prevents the unnecessary consumption of space bye WAL storage.
13. MASTER: Query the count of the records produced by the automated process on the master.
14. SLAVE: Query the count of the records as present on the slave.
Compare the results of 13 and 14. We observe a difference here.
We would appreciate if you can help us figure out the cause for the difference in the record set and how can it be avoided.
Thanks and regards,
Ujjawal