Minimizing Recovery Time (wal replication)

Bryan Murphy <bmurphy1976@xxxxxxxxx> · Thu, 9 Apr 2009 14:27:41 -0500

I have two hot-spare databases that use wal archiving and continuous
recovery mode.  I want to minimize recovery time when we have to fail
over to one of our hot spares.  Right now, I'm seeing the following
behavior which makes a quick recovery seem problematic:

(1) hot spare applies 70 to 75 wal files (~1.1g) in 2 to 3 min period

(2) hot spare pauses for 15 to 20 minutes, during this period pdflush
consumes 99% IO (iotop).  Dirty (from /proc/meminfo) spikes to ~760mb,
remains at that level for the first 10 minutes, and then slowly ticks
down to 0 for the second 10 minutes.

(3) goto 1

My concern is that if the database has been in recovery mode for some
time, even if it's caught up, if I go live sometime in (1) I can face
a recovery time of upwards of 20 minutes.  We've experienced delays
during fail over in the past (not 20 minutes, but long enough to make
me second guess what we are doing).

I want to better understand what is going on so that I can determine
what I can do (if anything) to minimize down time when we fail over to
one of our hot spares.

Here are my current settings:

postgres (v8.3.7):

shared_buffers = 2GB (15GB total)
effective_cache_size = 12GB (15GB total)
checkpoint_segments = 10
checkpoint_completion_target = 0.7
(other checkpoint/bgwriter settings left at default values)

sysctl:

kernel.shmmax = 2684354560
vm.dirty_background_ratio = 1
vm.dirty_ratio = 5

Thanks,
Bryan

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general