On Sat, Jun 1, 2019 at 8:25 PM Tom K <tomkcpr@xxxxxxxxx> wrote:
So the best bet will be trying to get through this error then:[ PSQL02 ]
PANIC:replication check point has wrong magic 0 instead of 307747550On Sat, Jun 1, 2019 at 8:21 PM Tom K <tomkcpr@xxxxxxxxx> wrote:On Sat, Jun 1, 2019 at 7:34 PM Adrian Klaver <adrian.klaver@xxxxxxxxxxx> wrote:On 6/1/19 4:22 PM, Tom K wrote:
>
>
>
> Looks like this crash was far more catastrophic then I thought. By the
> looks of things, thinking on psql02 would be my best bet.
>
The more I look at it the more I think the replication was not doing
what you thought it was doing. That psql02 was the primary and that
psql01 and psql03 where out of sync and/or defunct standbys.Now that I look at the files myself, that's the conclusion I was coming to myself. Sample config:[root@psql02 base]# cat /etc/patroni.yml
scope: postgres
namespace: /db/
name: postgresql1
restapi:
listen: 192.168.0.124:8008
connect_address: 192.168.0.124:8008
etcd:
host: 192.168.0.124:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
postgresql:
use_pg_rewind: true
initdb:
- encoding: UTF8
- data-checksums
pg_hba:
- host replication replicator 127.0.0.1/32 md5
- host replication replicator 192.168.0.108/0 md5
- host replication replicator 192.168.0.124/0 md5
- host replication replicator 192.168.0.118/0 md5
- host all all 0.0.0.0/0 md5
users:
admin:
password: admin
options:
- createrole
- createdb
postgresql:
listen: 192.168.0.124:5432
bin_dir: /usr/pgsql-10/bin
connect_address: 192.168.0.124:5432
data_dir: /data/patroni
pgpass: /tmp/pgpass
unix_socket_directories: /data/patroni
authentication:
replication:
username: replicator
password: rep-pass
superuser:
username: postgres
password: <SECRET>
parameters:
unix_socket_directories: '.'
tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false
[root@psql02 base]#Or perhaps when the system crashed, the filesystem check simply moved the folders out due to corruption.
Trying what we did above but on the second node:
2019-06-01 20:28:45.305 EDT [21745] LOG: database system was interrupted; last known up at 2019-04-28 06:06:24 EDT
2019-06-01 20:28:46.142 EDT [21745] DEBUG: primary_slot_name = 'postgresql1'
2019-06-01 20:28:46.142 EDT [21745] DEBUG: standby_mode = 'on'
2019-06-01 20:28:46.142 EDT [21745] DEBUG: recovery_target_timeline = latest
2019-06-01 20:28:46.142 EDT [21745] WARNING: recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command
2019-06-01 20:28:46.142 EDT [21745] HINT: The database server will regularly poll the pg_wal subdirectory to check for files placed there.
2019-06-01 20:28:46.142 EDT [21745] LOG: entering standby mode
2019-06-01 20:28:46.142 EDT [21745] LOG: invalid record length at 0/4C35CDF8: wanted 24, got 0
2019-06-01 20:28:46.142 EDT [21745] LOG: invalid primary checkpoint record
2019-06-01 20:28:46.142 EDT [21745] LOG: using previous checkpoint record at 0/4C34EDA8
2019-06-01 20:28:46.142 EDT [21745] DEBUG: redo record is at 0/4C34ED70; shutdown FALSE
2019-06-01 20:28:46.143 EDT [21745] DEBUG: next transaction ID: 0:1409831; next OID: 237578
2019-06-01 20:28:46.143 EDT [21745] DEBUG: next MultiXactId: 48; next MultiXactOffset: 174
2019-06-01 20:28:46.143 EDT [21745] DEBUG: oldest unfrozen transaction ID: 549, in database 1
2019-06-01 20:28:46.143 EDT [21745] DEBUG: oldest MultiXactId: 1, in database 1
2019-06-01 20:28:46.143 EDT [21745] DEBUG: commit timestamp Xid oldest/newest: 0/0
2019-06-01 20:28:46.143 EDT [21745] DEBUG: transaction ID wrap limit is 2147484196, limited by database with OID 1
2019-06-01 20:28:46.143 EDT [21745] DEBUG: MultiXactId wrap limit is 2147483648, limited by database with OID 1
2019-06-01 20:28:46.143 EDT [21745] DEBUG: starting up replication slots
2019-06-01 20:28:46.143 EDT [21745] DEBUG: starting up replication origin progress state
2019-06-01 20:28:46.143 EDT [21745] PANIC: replication checkpoint has wrong magic 0 instead of 307747550
2019-06-01 20:28:46.143 EDT [21743] DEBUG: reaping dead processes
2019-06-01 20:28:46.143 EDT [21743] LOG: startup process (PID 21745) was terminated by signal 6: Aborted
2019-06-01 20:28:46.143 EDT [21743] LOG: aborting startup due to startup process failure
2019-06-01 20:28:46.144 EDT [21743] DEBUG: shmem_exit(1): 0 before_shmem_exit callbacks to make
2019-06-01 20:28:46.144 EDT [21743] DEBUG: shmem_exit(1): 5 on_shmem_exit callbacks to make
2019-06-01 20:28:46.144 EDT [21743] DEBUG: cleaning up dynamic shared memory control segment with ID 222930115
2019-06-01 20:28:46.145 EDT [21743] DEBUG: proc_exit(1): 2 callbacks to make
2019-06-01 20:28:46.146 EDT [21743] LOG: database system is shut down
2019-06-01 20:28:46.146 EDT [21743] DEBUG: exit(1)
2019-06-01 20:28:46.146 EDT [21743] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-06-01 20:28:46.146 EDT [21743] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-06-01 20:28:46.146 EDT [21743] DEBUG: proc_exit(-1): 0 callbacks to make
2019-06-01 20:28:46.147 EDT [21744] DEBUG: logger shutting down
2019-06-01 20:28:46.147 EDT [21744] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make
2019-06-01 20:28:46.147 EDT [21744] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make
2019-06-01 20:28:46.147 EDT [21744] DEBUG: proc_exit(0): 0 callbacks to make
2019-06-01 20:28:46.147 EDT [21744] DEBUG: exit(0)
2019-06-01 20:28:46.147 EDT [21744] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-06-01 20:28:46.147 EDT [21744] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-06-01 20:28:46.147 EDT [21744] DEBUG: proc_exit(-1): 0 callbacks to make
^C
-bash-4.2$
2019-06-01 20:28:46.142 EDT [21745] DEBUG: primary_slot_name = 'postgresql1'
2019-06-01 20:28:46.142 EDT [21745] DEBUG: standby_mode = 'on'
2019-06-01 20:28:46.142 EDT [21745] DEBUG: recovery_target_timeline = latest
2019-06-01 20:28:46.142 EDT [21745] WARNING: recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command
2019-06-01 20:28:46.142 EDT [21745] HINT: The database server will regularly poll the pg_wal subdirectory to check for files placed there.
2019-06-01 20:28:46.142 EDT [21745] LOG: entering standby mode
2019-06-01 20:28:46.142 EDT [21745] LOG: invalid record length at 0/4C35CDF8: wanted 24, got 0
2019-06-01 20:28:46.142 EDT [21745] LOG: invalid primary checkpoint record
2019-06-01 20:28:46.142 EDT [21745] LOG: using previous checkpoint record at 0/4C34EDA8
2019-06-01 20:28:46.142 EDT [21745] DEBUG: redo record is at 0/4C34ED70; shutdown FALSE
2019-06-01 20:28:46.143 EDT [21745] DEBUG: next transaction ID: 0:1409831; next OID: 237578
2019-06-01 20:28:46.143 EDT [21745] DEBUG: next MultiXactId: 48; next MultiXactOffset: 174
2019-06-01 20:28:46.143 EDT [21745] DEBUG: oldest unfrozen transaction ID: 549, in database 1
2019-06-01 20:28:46.143 EDT [21745] DEBUG: oldest MultiXactId: 1, in database 1
2019-06-01 20:28:46.143 EDT [21745] DEBUG: commit timestamp Xid oldest/newest: 0/0
2019-06-01 20:28:46.143 EDT [21745] DEBUG: transaction ID wrap limit is 2147484196, limited by database with OID 1
2019-06-01 20:28:46.143 EDT [21745] DEBUG: MultiXactId wrap limit is 2147483648, limited by database with OID 1
2019-06-01 20:28:46.143 EDT [21745] DEBUG: starting up replication slots
2019-06-01 20:28:46.143 EDT [21745] DEBUG: starting up replication origin progress state
2019-06-01 20:28:46.143 EDT [21745] PANIC: replication checkpoint has wrong magic 0 instead of 307747550
2019-06-01 20:28:46.143 EDT [21743] DEBUG: reaping dead processes
2019-06-01 20:28:46.143 EDT [21743] LOG: startup process (PID 21745) was terminated by signal 6: Aborted
2019-06-01 20:28:46.143 EDT [21743] LOG: aborting startup due to startup process failure
2019-06-01 20:28:46.144 EDT [21743] DEBUG: shmem_exit(1): 0 before_shmem_exit callbacks to make
2019-06-01 20:28:46.144 EDT [21743] DEBUG: shmem_exit(1): 5 on_shmem_exit callbacks to make
2019-06-01 20:28:46.144 EDT [21743] DEBUG: cleaning up dynamic shared memory control segment with ID 222930115
2019-06-01 20:28:46.145 EDT [21743] DEBUG: proc_exit(1): 2 callbacks to make
2019-06-01 20:28:46.146 EDT [21743] LOG: database system is shut down
2019-06-01 20:28:46.146 EDT [21743] DEBUG: exit(1)
2019-06-01 20:28:46.146 EDT [21743] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-06-01 20:28:46.146 EDT [21743] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-06-01 20:28:46.146 EDT [21743] DEBUG: proc_exit(-1): 0 callbacks to make
2019-06-01 20:28:46.147 EDT [21744] DEBUG: logger shutting down
2019-06-01 20:28:46.147 EDT [21744] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make
2019-06-01 20:28:46.147 EDT [21744] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make
2019-06-01 20:28:46.147 EDT [21744] DEBUG: proc_exit(0): 0 callbacks to make
2019-06-01 20:28:46.147 EDT [21744] DEBUG: exit(0)
2019-06-01 20:28:46.147 EDT [21744] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-06-01 20:28:46.147 EDT [21744] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-06-01 20:28:46.147 EDT [21744] DEBUG: proc_exit(-1): 0 callbacks to make
^C
-bash-4.2$
Removing the recovery.conf and restarting yields the following messages:
2019-06-01 20:31:12.231 EDT [21910] LOG: database system was interrupted; last known up at 2019-04-28 06:06:24 EDT
2019-06-01 20:31:12.505 EDT [21910] LOG: invalid record length at 0/4C35CDF8: wanted 24, got 0
2019-06-01 20:31:12.505 EDT [21910] LOG: invalid primary checkpoint record
2019-06-01 20:31:12.505 EDT [21910] LOG: using previous checkpoint record at 0/4C34EDA8
2019-06-01 20:31:12.505 EDT [21910] DEBUG: redo record is at 0/4C34ED70; shutdown FALSE
2019-06-01 20:31:12.505 EDT [21910] DEBUG: next transaction ID: 0:1409831; next OID: 237578
2019-06-01 20:31:12.505 EDT [21910] DEBUG: next MultiXactId: 48; next MultiXactOffset: 174
2019-06-01 20:31:12.505 EDT [21910] DEBUG: oldest unfrozen transaction ID: 549, in database 1
2019-06-01 20:31:12.505 EDT [21910] DEBUG: oldest MultiXactId: 1, in database 1
2019-06-01 20:31:12.505 EDT [21910] DEBUG: commit timestamp Xid oldest/newest: 0/0
2019-06-01 20:31:12.505 EDT [21910] DEBUG: transaction ID wrap limit is 2147484196, limited by database with OID 1
2019-06-01 20:31:12.505 EDT [21910] DEBUG: MultiXactId wrap limit is 2147483648, limited by database with OID 1
2019-06-01 20:31:12.505 EDT [21910] DEBUG: starting up replication slots
2019-06-01 20:31:12.505 EDT [21910] DEBUG: starting up replication origin progress state
2019-06-01 20:31:12.505 EDT [21910] PANIC: replication checkpoint has wrong magic 0 instead of 307747550
2019-06-01 20:31:12.506 EDT [21908] DEBUG: reaping dead processes
2019-06-01 20:31:12.506 EDT [21908] LOG: startup process (PID 21910) was terminated by signal 6: Aborted
2019-06-01 20:31:12.506 EDT [21908] LOG: aborting startup due to startup process failure
2019-06-01 20:31:12.506 EDT [21908] DEBUG: shmem_exit(1): 0 before_shmem_exit callbacks to make
2019-06-01 20:31:12.506 EDT [21908] DEBUG: shmem_exit(1): 5 on_shmem_exit callbacks to make
2019-06-01 20:31:12.506 EDT [21908] DEBUG: cleaning up dynamic shared memory control segment with ID 976986759
2019-06-01 20:31:12.508 EDT [21908] DEBUG: proc_exit(1): 2 callbacks to make
2019-06-01 20:31:12.508 EDT [21908] LOG: database system is shut down
2019-06-01 20:31:12.508 EDT [21908] DEBUG: exit(1)
2019-06-01 20:31:12.508 EDT [21908] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-06-01 20:31:12.508 EDT [21908] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-06-01 20:31:12.508 EDT [21908] DEBUG: proc_exit(-1): 0 callbacks to make
2019-06-01 20:31:12.510 EDT [21909] DEBUG: logger shutting down
2019-06-01 20:31:12.510 EDT [21909] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make
2019-06-01 20:31:12.510 EDT [21909] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make
2019-06-01 20:31:12.510 EDT [21909] DEBUG: proc_exit(0): 0 callbacks to make
2019-06-01 20:31:12.510 EDT [21909] DEBUG: exit(0)
2019-06-01 20:31:12.510 EDT [21909] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-06-01 20:31:12.510 EDT [21909] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-06-01 20:31:12.510 EDT [21909] DEBUG: proc_exit(-1): 0 callbacks to make
2019-06-01 20:31:12.505 EDT [21910] LOG: invalid record length at 0/4C35CDF8: wanted 24, got 0
2019-06-01 20:31:12.505 EDT [21910] LOG: invalid primary checkpoint record
2019-06-01 20:31:12.505 EDT [21910] LOG: using previous checkpoint record at 0/4C34EDA8
2019-06-01 20:31:12.505 EDT [21910] DEBUG: redo record is at 0/4C34ED70; shutdown FALSE
2019-06-01 20:31:12.505 EDT [21910] DEBUG: next transaction ID: 0:1409831; next OID: 237578
2019-06-01 20:31:12.505 EDT [21910] DEBUG: next MultiXactId: 48; next MultiXactOffset: 174
2019-06-01 20:31:12.505 EDT [21910] DEBUG: oldest unfrozen transaction ID: 549, in database 1
2019-06-01 20:31:12.505 EDT [21910] DEBUG: oldest MultiXactId: 1, in database 1
2019-06-01 20:31:12.505 EDT [21910] DEBUG: commit timestamp Xid oldest/newest: 0/0
2019-06-01 20:31:12.505 EDT [21910] DEBUG: transaction ID wrap limit is 2147484196, limited by database with OID 1
2019-06-01 20:31:12.505 EDT [21910] DEBUG: MultiXactId wrap limit is 2147483648, limited by database with OID 1
2019-06-01 20:31:12.505 EDT [21910] DEBUG: starting up replication slots
2019-06-01 20:31:12.505 EDT [21910] DEBUG: starting up replication origin progress state
2019-06-01 20:31:12.505 EDT [21910] PANIC: replication checkpoint has wrong magic 0 instead of 307747550
2019-06-01 20:31:12.506 EDT [21908] DEBUG: reaping dead processes
2019-06-01 20:31:12.506 EDT [21908] LOG: startup process (PID 21910) was terminated by signal 6: Aborted
2019-06-01 20:31:12.506 EDT [21908] LOG: aborting startup due to startup process failure
2019-06-01 20:31:12.506 EDT [21908] DEBUG: shmem_exit(1): 0 before_shmem_exit callbacks to make
2019-06-01 20:31:12.506 EDT [21908] DEBUG: shmem_exit(1): 5 on_shmem_exit callbacks to make
2019-06-01 20:31:12.506 EDT [21908] DEBUG: cleaning up dynamic shared memory control segment with ID 976986759
2019-06-01 20:31:12.508 EDT [21908] DEBUG: proc_exit(1): 2 callbacks to make
2019-06-01 20:31:12.508 EDT [21908] LOG: database system is shut down
2019-06-01 20:31:12.508 EDT [21908] DEBUG: exit(1)
2019-06-01 20:31:12.508 EDT [21908] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-06-01 20:31:12.508 EDT [21908] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-06-01 20:31:12.508 EDT [21908] DEBUG: proc_exit(-1): 0 callbacks to make
2019-06-01 20:31:12.510 EDT [21909] DEBUG: logger shutting down
2019-06-01 20:31:12.510 EDT [21909] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make
2019-06-01 20:31:12.510 EDT [21909] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make
2019-06-01 20:31:12.510 EDT [21909] DEBUG: proc_exit(0): 0 callbacks to make
2019-06-01 20:31:12.510 EDT [21909] DEBUG: exit(0)
2019-06-01 20:31:12.510 EDT [21909] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-06-01 20:31:12.510 EDT [21909] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-06-01 20:31:12.510 EDT [21909] DEBUG: proc_exit(-1): 0 callbacks to make
--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx