Database corruption in cascaded replica, "pg_xact/003A" doesn't exist, reading as zeroes"

Dejan Petrovic <dejan.petrovic@xxxxxxxxxxxxx> · Wed, 5 Dec 2018 15:10:13 +0100

I believe this is a result of my "broken" procedure for setting up a 
cascaded replica. I would love to know where the issue is.

This is a report of a database corruption which was detected after 
promoting a replica server to a master server. We suspect the actual 
corruption occurred during replication of a two-level cascaded replica, 
however it was not detected until after it was promoted to master server 
when pg_dump and pg_restore failed. After re-checking postgresql logs on 
the corrupted node, we found this log line, which was the only indicator 
of a possible issue, but was overlooked: "2018-07-25 13:14:42 UTCLOG: 
file "pg_xact/003A" doesn't exist, reading as zeroes".

After extensive investigation, we detected three different cases of data 
corruption:
1.)Issue with pg_toast - happens during pg_dump or when row with 
corrupted data is selected
"pg_dump: Dumping the contents of table failed: PQgetResult() failed.
pg_dump: Error message from server: ERROR: missing chunk number 2 for 
toast value 86123528 in pg_toast_41291242"

2.) Issue with uncommitted transactions, occurs when running VACUUM or 
pg_dump/pg_restore:
"ERROR: uncommitted xmin 60817551 from before xid cutoff 61487222 needs 
to be frozen
SQL state: XX001"

3.) 2 duplicated rows in different tables, violated PK unique 
constraint, occurs during pg_restore

This is how we fixed the three issues:
1.) Using bisection and SELECT, we found the row/column with corrupted 
data. We nulled the corrupt column and deleted the row (it was not 
critical, had no FKs)
2.) We ran VACUUM on all tables to check where this issue occured and 
updated all rows between reported xmin and xid cutoff with same data - 
this generated a new xmin on the broken rows, which fixed the issue.
3.) We removed the duplicated rows in a transaction, disabled all 
triggers before and enabled them right after
4.) Our final step is to do a full dump and restore on master DB so that 
the DB is rebuilt

This is the procedure we used to replicate the cascaded replica (master 
-> replica -> replica)

Notes:
Machines are running on Centos 7, Postgresql 10.2
DB-1 = master
DB-2 = replica of DB-1
DB-3 = replica of DB-2

--------------> DB-3 (new cascaded replica)
systemctl disable postgresql-10
systemctl stop postgresql-10
--------------> DB-1 (master)
psql -U postgres
select pg_start_backup('clone',true);
<------------- DB-3 (cascaded replica)
rsync -azv -e --delete DB-2:/isl/pgsql10/data /isl/pgsql10/ --exclude 
pg_wal --exclude postgresql.pid
assert rsync finished
vi recovery.conf
    standby_mode = 'on'
    primary_conninfo = 'host=DB-2'
    recovery_target_timeline = 'latest'
-------------> DB-1
select pg_stop_backup();
<------------- DB-3
rsync -azv -e --delete DB-2:/isl/pgsql10/data/pg_wal /isl/pgsql10/data/
systemctl start postgresql-10
systemctl enable postgresql-10
tail log file

Any comments regarding the cascaded replication procedure or database 
corruption detection or resolution are welcome.

Best regards,

Dejan Petrovic
ISL Online