'replication checkpoint has wrong magic' on the newly cloned replicas

Alex Kliukin <alexk@xxxxxxxxxxxx> · Wed, 29 Nov 2017 15:37:20 +0100

Hello,

We have recently received four new servers to replace our existing
replicas of
one sharded database. Those correspond to four shards, let's name them
shard1 -
shard4 (each taking roughly 560GB of disk space), all of them running
PostgreSQL 9.6.5 on the masters.

The new servers are running HP ProLiant DL380 G9 with Intel(R) Xeon(R)
CPU
E5-2667 v4 @ 3.20GHz, 188GB of ECC RAM and and HP Smart Array P440ar
RAID
controller. We are running RAID10,  the mirroring is done by the
controller, and
the striping is handled by the lvm. The uname reports:

Linux shard1 4.4.0-100-generic #123-Ubuntu SMP Thu Nov 2 10:16:13 UTC
2017 x86_64 x86_64 x86_64 GNU/Linux

The replicas being cloned are running PostgreSQL 9.6.6. The cloning
process on
all shards took roughly 30-32 minutes. On shards 1 and 2 we've got
the following errors (taken from shard1):

017-11-15 13:15:45.767 CET,,,15154,,5a0c2ff1.3b32,1,,2017-11-15 13:15:45
CET,,0,LOG,00000,"database system was interrupted; last known up at
2017-11-15 13:12:19 CET",,,,,,,,,""
2017-11-15 13:15:45.767 CET,,,15154,,5a0c2ff1.3b32,2,,2017-11-15
13:15:45 CET,,0,LOG,00000,"creating missing WAL directory
""pg_xlog/archive_status""",,,,,,,,,""
2017-11-15 13:15:46.506 CET,,,15154,,5a0c2ff1.3b32,3,,2017-11-15
13:15:45 CET,,0,LOG,00000,"entering standby mode",,,,,,,,,""
2017-11-15 13:15:46.666 CET,,,15154,,5a0c2ff1.3b32,4,,2017-11-15
13:15:45 CET,,0,LOG,00000,"restored log file
""000000010000640F000000F4"" from archive",,,,,,,,,""
2017-11-15 13:15:46.673 CET,,,15154,,5a0c2ff1.3b32,5,,2017-11-15
13:15:45 CET,,0,PANIC,XX000,"replication checkpoint has wrong magic
5714534 instead of 307747550",,,,,,,,,""
2017-11-15 13:15:46.674 CET,,,15152,,5a0c2ff1.3b30,2,,2017-11-15
13:15:45 CET,,0,LOG,00000,"startup process (PID 15154) was terminated by
signal 6: Aborted",,,,,,,,,""

On the other shards the restore process worked as expected:

2017-11-15 13:16:52.985 CET,,,13901,,5a0c3034.364d,2,,2017-11-15
13:16:52 CET,,0,LOG,00000,"creating missing WAL directory
""pg_xlog/archive_status""",,,,,,,,,""
2017-11-15 13:16:53.790 CET,,,13901,,5a0c3034.364d,3,,2017-11-15
13:16:52 CET,,0,LOG,00000,"entering standby mode",,,,,,,,,""
2017-11-15 13:16:53.968 CET,,,13901,,5a0c3034.364d,4,,2017-11-15
13:16:52 CET,,0,LOG,00000,"restored log file
""000000010000629100000080"" from archive",,,,,,,,,""
2017-11-15 13:16:54.057 CET,,,13901,,5a0c3034.364d,5,,2017-11-15
13:16:52 CET,1/0,0,LOG,00000,"redo starts at 6291/800B6C60",,,,,,,,,""

The second attempt to create the replicas for shards 1 and 2 was
successful,
producing normal replicas.

The cloning itself is done by copying a compressed image via ssh,
running the
following command from the replica:

 """ssh {master} 'cd {master_datadir} && tar -lcp --exclude "*.conf" \
         --exclude "recovery.done" \
         --exclude "pacemaker_instanz" \
         --exclude "dont_start" \
         --exclude "pg_log" \
         --exclude "pg_xlog" \
         --exclude "postmaster.pid" \
         --exclude "recovery.done" \
           * | pigz -1 -p 4' | pigz -d -p 4 | tar -xpmUv -C
           {slave_datadir}""

The WAL archiving starts before the copy starts, as the script that
clones the
replica checks that the WALs archiving is running before the cloning.

We have cloned hundreds of replicas with that procedure and never saw
any
issues, also never saw the "replication checkpoint has wrong magic"
error, so
we are wondering what could be the possible reason behind that failure?
We also
saw the disk error on another shard not long after the initial copy (but
not on
those that had the "replication checkpoint error"), so hardware issues
are on
our list as well (but then how comes both had the same wrong value for
the
"wrong magic"?)
-- 
Sincerely,
Alex