Hello, We have recently received four new servers to replace our existing replicas of one sharded database. Those correspond to four shards, let's name them shard1 - shard4 (each taking roughly 560GB of disk space), all of them running PostgreSQL 9.6.5 on the masters. The new servers are running HP ProLiant DL380 G9 with Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz, 188GB of ECC RAM and and HP Smart Array P440ar RAID controller. We are running RAID10, the mirroring is done by the controller, and the striping is handled by the lvm. The uname reports: Linux shard1 4.4.0-100-generic #123-Ubuntu SMP Thu Nov 2 10:16:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux The replicas being cloned are running PostgreSQL 9.6.6. The cloning process on all shards took roughly 30-32 minutes. On shards 1 and 2 we've got the following errors (taken from shard1): 017-11-15 13:15:45.767 CET,,,15154,,5a0c2ff1.3b32,1,,2017-11-15 13:15:45 CET,,0,LOG,00000,"database system was interrupted; last known up at 2017-11-15 13:12:19 CET",,,,,,,,,"" 2017-11-15 13:15:45.767 CET,,,15154,,5a0c2ff1.3b32,2,,2017-11-15 13:15:45 CET,,0,LOG,00000,"creating missing WAL directory ""pg_xlog/archive_status""",,,,,,,,,"" 2017-11-15 13:15:46.506 CET,,,15154,,5a0c2ff1.3b32,3,,2017-11-15 13:15:45 CET,,0,LOG,00000,"entering standby mode",,,,,,,,,"" 2017-11-15 13:15:46.666 CET,,,15154,,5a0c2ff1.3b32,4,,2017-11-15 13:15:45 CET,,0,LOG,00000,"restored log file ""000000010000640F000000F4"" from archive",,,,,,,,,"" 2017-11-15 13:15:46.673 CET,,,15154,,5a0c2ff1.3b32,5,,2017-11-15 13:15:45 CET,,0,PANIC,XX000,"replication checkpoint has wrong magic 5714534 instead of 307747550",,,,,,,,,"" 2017-11-15 13:15:46.674 CET,,,15152,,5a0c2ff1.3b30,2,,2017-11-15 13:15:45 CET,,0,LOG,00000,"startup process (PID 15154) was terminated by signal 6: Aborted",,,,,,,,,"" On the other shards the restore process worked as expected: 2017-11-15 13:16:52.985 CET,,,13901,,5a0c3034.364d,2,,2017-11-15 13:16:52 CET,,0,LOG,00000,"creating missing WAL directory ""pg_xlog/archive_status""",,,,,,,,,"" 2017-11-15 13:16:53.790 CET,,,13901,,5a0c3034.364d,3,,2017-11-15 13:16:52 CET,,0,LOG,00000,"entering standby mode",,,,,,,,,"" 2017-11-15 13:16:53.968 CET,,,13901,,5a0c3034.364d,4,,2017-11-15 13:16:52 CET,,0,LOG,00000,"restored log file ""000000010000629100000080"" from archive",,,,,,,,,"" 2017-11-15 13:16:54.057 CET,,,13901,,5a0c3034.364d,5,,2017-11-15 13:16:52 CET,1/0,0,LOG,00000,"redo starts at 6291/800B6C60",,,,,,,,,"" The second attempt to create the replicas for shards 1 and 2 was successful, producing normal replicas. The cloning itself is done by copying a compressed image via ssh, running the following command from the replica: """ssh {master} 'cd {master_datadir} && tar -lcp --exclude "*.conf" \ --exclude "recovery.done" \ --exclude "pacemaker_instanz" \ --exclude "dont_start" \ --exclude "pg_log" \ --exclude "pg_xlog" \ --exclude "postmaster.pid" \ --exclude "recovery.done" \ * | pigz -1 -p 4' | pigz -d -p 4 | tar -xpmUv -C {slave_datadir}"" The WAL archiving starts before the copy starts, as the script that clones the replica checks that the WALs archiving is running before the cloning. We have cloned hundreds of replicas with that procedure and never saw any issues, also never saw the "replication checkpoint has wrong magic" error, so we are wondering what could be the possible reason behind that failure? We also saw the disk error on another shard not long after the initial copy (but not on those that had the "replication checkpoint error"), so hardware issues are on our list as well (but then how comes both had the same wrong value for the "wrong magic"?) -- Sincerely, Alex