Re: root cause of corruption in hot standby

Mike Broers <mbroers@xxxxxxxxx> · Thu, 20 Sep 2018 10:13:03 -0500

I ran fsck and reran mkfs.ext4 in case there was something wrong with the underlying file systems, reran a pg_basebackup and got the same problem.  It seems to be choking up on an archived wal that was rsync'ed during a time of a nightly vacuum.
2018-09-20 09:48:22 CDT [21017]: [5305-1] user=,db= LOG:  restored log file "0000000100007D260000000E" from archive
2018-09-20 09:48:22 CDT [21017]: [5306-1] user=,db= LOG:  incorrect resource manager data checksum in record at 7D26/EC04550
2018-09-20 09:48:22 CDT [28296]: [1-1] user=,db= LOG:  started streaming WAL from primary at 7D26/E000000 on timeline 1
2018-09-20 09:48:22 CDT [28296]: [2-1] user=,db= FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000100007D260000000E has already been removed

I dont know if this is a good test but I attempted to diff the file from primary and the file that got rsynced and they 'differ'

-bash-4.2$ diff 0000000100007D260000000E 0000000100007D260000000E.sch1-db8 
Binary files 0000000100007D260000000E and 0000000100007D260000000E.sch1-db8 differ

It seems like my archived wal rsync strategy is flawed? Im just running a cron to rsync from the primary server to the replica every 15 minutes.  

rsync -az -B 16384 --rsh=ssh ${PGPRIMARYHOST}/archive ${PGBACKUP}/archive

Although its strange to me if its rsyncing a file before writes complete why are they the same size?  

-rw------- 1 postgres postgres 16777216 Sep 20 03:57 0000000100007D260000000E

-bash-4.2$ ls -l 0000000100007D260000000E.sch1-db8
-rw------- 1 postgres postgres 16777216 Sep 20 10:05 0000000100007D260000000E.sch1-db8 

I will RTD on how it is advised to copy the wals and see if I am missing something obvious, but any help is always appreciated :)

Mike

On Wed, Sep 19, 2018 at 9:14 AM Mike Broers <mbroers@xxxxxxxxx> wrote:
A fresh replica using pg_basebackup on the same system generated similar errors:.
cp: cannot stat ‘/mnt/backup/pgsql/9.5/archive/00000002.history’: No such file or directory
2018-09-19 08:36:23 CDT [57006]: [179-1] user=,db= LOG:  restored log file "0000000100007CFC00000040" from archive
2018-09-19 08:36:23 CDT [57006]: [180-1] user=,db= LOG:  incorrect resource manager data checksum in record at 7CFC/405ED198

I am going to run file system checks now.. Maybe the backup volume the archived wals get rsync'ed to has problems and is corrupting on the replay?  No checksum alerts on primary or an additional replica..

If I can supply additional info that would help get some advice please let me know.  

Postgres 9.5.14, CentOS 7, ext4 filesystem, hyper-v VM

On Mon, Sep 10, 2018 at 8:40 AM Mike Broers <mbroers@xxxxxxxxx> wrote:
Well I've verified my primary backups are working, and think my plan is to patch to 9.5.14, reprime a replica in the same environment and see how it goes unless someone has an idea of something to check on the host to avoid future corruption...

On Thu, Sep 6, 2018 at 12:28 PM Mike Broers <mbroers@xxxxxxxxx> wrote:
So I have discovered corruption in a postgres 9.5.12 read replica, yay checksums!
2018-09-06 12:00:53 CDT [1563]: [4-1] user=postgres,db=production WARNING:  page verification failed, calculated checksum 3482 but expected 32232

2018-09-06 12:00:53 CDT [1563]: [5-1] user=postgres,db=production ERROR:  invalid page in block 15962 of relation base/16384/464832386

The rest of the log is clean and just has usual monitoring queries as this isnt a heavily used db.

This corruption isnt occurring on the primary or a second replica, so I'm not freaking out exactly, but Im not sure how I can further diagnose what the root cause of the corruption might be. 

There were no power outages.  This is a streaming hot standby replica that looks like it was connected fine to its primary xlog at the time, and not falling back on rsync'ed WALS or anything.  We run off an SSD SAN that is allocated using LVM and I've noticed documentation that states that can be problematic, but I'm unclear on how to diagnose what might have been the root cause and now I'm somewhat uncomfortable with this environments reliability in general.  

Does anyone have advice for what to check further to determine a possible root cause?  This is a CentOS 7 vm running on Hyper-V.  

Thanks for any assistance, greatly appreciated!
Mike