On Jan 19, 2008, at 1:24 PM, Simon Riggs wrote:
So we definitely have missing data. I think the multi-phase rsync is
definitely suspect and should be avoided until we get to the bottom of
this.
By this, do you mean the way we ran the rsync multiple times before
performing the actual base backup? Easy enough.
Can you examine the server logs for that 11 hour period and see if
there
are any other messages that might be relevant on both primary and
standby servers?
On the primary, all relevant log entries are just entries for
successfully logged WALs.
Can you save the WAL files covering that period also. We may want to
inspect them later to confirm whether the data was actually in them or
not. Can you save the very first WAL files after the recovery started?
We can use those to examine the first data blocks touched, which would
help confirm when the gun was fired.
Sure, I still have all of the WALs that were shipped from just before
the beginning of the base back up until about 8 or 9 hours after the
missing WAL file. I'm not sure if I'll be able to get to it today as
I've got lots of household chores to get to, but either tomorrow of
first thing Monday I'll zip them and post them here.
Has rsync produced any messages of note? Might your script be ignoring
errors? What versions of rsync are you using?
Version 2.6.9 on all three servers (1 primary, 2 standbys).
Here's the python script I was using to ship to both servers, right
now I'm back to a direct rsync call for my archive_command to sb1.
What's really weird, is that for the two WALs that disappeared, or
didn't make it, even though the primary's log said they were
successfully archived, they weren't on either server.
#####
#!/usr/bin/env python
from commands import getstatusoutput
import sys
source_path = sys.argv[1]
file_name = sys.argv[2]
# db2
res, out = getstatusoutput("ssh db2 'test -f /mnt/xfer/wal_archive/%
s'" % file_name)
if res != 0:
res, out = getstatusoutput("rsync -a %s db2:/mnt/xfer/wal_archive/"
% source_path)
if res != 0:
sys.exit(res)
# sb1
res, out = getstatusoutput("test -f /mnt/xfer/wal_archive/%s" %
file_name)
if res != 0:
res, out = getstatusoutput("rsync -a %s /mnt/xfer/wal_archive",
source_path)
sys.exit(res)
#####
Is the History table Insert only? We might use that fact to examine
the
LSNs of the equivalent blocks on the Primary. If the LSNs are prior to
the start of the recovery, as noted in the backup label file of the
original base backup, then we can confirm rsync or postgres as the
cause. Do you still have the base backup label file or base backup?
I still have the backup label but not the base backup. The history
tables are about 90-95% insert only, on average, but they do see the
occasional update if a message's delivery status is a bounce. If you
think this would still be an option, I'd be happy to follow whatever
instructions, or supply any files, needed.
Erik Jones
DBA | Emma®
erik@xxxxxxxxxx
800.595.4401 or 615.292.5888
615.292.0777 (fax)
Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend