Re: psql: FATAL: the database system is starting up

Tom K <tomkcpr@xxxxxxxxx> · Sat, 1 Jun 2019 20:32:46 -0400

On Sat, Jun 1, 2019 at 8:25 PM Tom K <tomkcpr@xxxxxxxxx> wrote:
So the best bet will be trying to get through this error then:
[ PSQL02 ]
PANIC:replication check point has wrong magic 0 instead of  307747550 

On Sat, Jun 1, 2019 at 8:21 PM Tom K <tomkcpr@xxxxxxxxx> wrote:

On Sat, Jun 1, 2019 at 7:34 PM Adrian Klaver <adrian.klaver@xxxxxxxxxxx> wrote:
On 6/1/19 4:22 PM, Tom K wrote:

> 

> 

> 

> Looks like this crash was far more catastrophic then I thought.  By the 

> looks of things, thinking on psql02 would be my best bet.

> 

The more I look at it the more I think the replication was not doing 

what you thought it was doing. That psql02 was the primary and that 

psql01 and psql03 where out of sync and/or defunct standbys.

Now that I look at the files myself, that's the conclusion I was coming to myself.  Sample config:

[root@psql02 base]# cat /etc/patroni.yml
scope: postgres
namespace: /db/
name: postgresql1

restapi:
    listen: 192.168.0.124:8008
    connect_address: 192.168.0.124:8008

etcd:
    host: 192.168.0.124:2379

bootstrap:
    dcs:
        ttl: 30
        loop_wait: 10
        retry_timeout: 10
        maximum_lag_on_failover: 1048576
        postgresql:
            use_pg_rewind: true

    initdb:
    - encoding: UTF8
    - data-checksums

    pg_hba:
    - host replication replicator 127.0.0.1/32 md5
    - host replication replicator 192.168.0.108/0 md5
    - host replication replicator 192.168.0.124/0 md5
    - host replication replicator 192.168.0.118/0 md5
    - host all all 0.0.0.0/0 md5

    users:
        admin:
            password: admin
            options:
                - createrole
                - createdb

postgresql:
    listen: 192.168.0.124:5432
    bin_dir: /usr/pgsql-10/bin
    connect_address: 192.168.0.124:5432
    data_dir: /data/patroni
    pgpass: /tmp/pgpass
    unix_socket_directories: /data/patroni
    authentication:
        replication:
            username: replicator
            password: rep-pass
        superuser:
            username: postgres
            password: <SECRET>
    parameters:
        unix_socket_directories: '.'

tags:
    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: false
[root@psql02 base]#

Or perhaps when the system crashed, the filesystem check simply moved the folders out due to corruption.  

Trying what we did above but on the second node:

2019-06-01 20:28:45.305 EDT [21745] LOG:  database system was interrupted; last known up at 2019-04-28 06:06:24 EDT
2019-06-01 20:28:46.142 EDT [21745] DEBUG:  primary_slot_name = 'postgresql1'
2019-06-01 20:28:46.142 EDT [21745] DEBUG:  standby_mode = 'on'
2019-06-01 20:28:46.142 EDT [21745] DEBUG:  recovery_target_timeline = latest
2019-06-01 20:28:46.142 EDT [21745] WARNING:  recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command
2019-06-01 20:28:46.142 EDT [21745] HINT:  The database server will regularly poll the pg_wal subdirectory to check for files placed there.
2019-06-01 20:28:46.142 EDT [21745] LOG:  entering standby mode
2019-06-01 20:28:46.142 EDT [21745] LOG:  invalid record length at 0/4C35CDF8: wanted 24, got 0
2019-06-01 20:28:46.142 EDT [21745] LOG:  invalid primary checkpoint record
2019-06-01 20:28:46.142 EDT [21745] LOG:  using previous checkpoint record at 0/4C34EDA8
2019-06-01 20:28:46.142 EDT [21745] DEBUG:  redo record is at 0/4C34ED70; shutdown FALSE
2019-06-01 20:28:46.143 EDT [21745] DEBUG:  next transaction ID: 0:1409831; next OID: 237578
2019-06-01 20:28:46.143 EDT [21745] DEBUG:  next MultiXactId: 48; next MultiXactOffset: 174
2019-06-01 20:28:46.143 EDT [21745] DEBUG:  oldest unfrozen transaction ID: 549, in database 1
2019-06-01 20:28:46.143 EDT [21745] DEBUG:  oldest MultiXactId: 1, in database 1
2019-06-01 20:28:46.143 EDT [21745] DEBUG:  commit timestamp Xid oldest/newest: 0/0
2019-06-01 20:28:46.143 EDT [21745] DEBUG:  transaction ID wrap limit is 2147484196, limited by database with OID 1
2019-06-01 20:28:46.143 EDT [21745] DEBUG:  MultiXactId wrap limit is 2147483648, limited by database with OID 1
2019-06-01 20:28:46.143 EDT [21745] DEBUG:  starting up replication slots
2019-06-01 20:28:46.143 EDT [21745] DEBUG:  starting up replication origin progress state
2019-06-01 20:28:46.143 EDT [21745] PANIC:  replication checkpoint has wrong magic 0 instead of 307747550
2019-06-01 20:28:46.143 EDT [21743] DEBUG:  reaping dead processes
2019-06-01 20:28:46.143 EDT [21743] LOG:  startup process (PID 21745) was terminated by signal 6: Aborted
2019-06-01 20:28:46.143 EDT [21743] LOG:  aborting startup due to startup process failure
2019-06-01 20:28:46.144 EDT [21743] DEBUG:  shmem_exit(1): 0 before_shmem_exit callbacks to make
2019-06-01 20:28:46.144 EDT [21743] DEBUG:  shmem_exit(1): 5 on_shmem_exit callbacks to make
2019-06-01 20:28:46.144 EDT [21743] DEBUG:  cleaning up dynamic shared memory control segment with ID 222930115
2019-06-01 20:28:46.145 EDT [21743] DEBUG:  proc_exit(1): 2 callbacks to make
2019-06-01 20:28:46.146 EDT [21743] LOG:  database system is shut down
2019-06-01 20:28:46.146 EDT [21743] DEBUG:  exit(1)
2019-06-01 20:28:46.146 EDT [21743] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-06-01 20:28:46.146 EDT [21743] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-06-01 20:28:46.146 EDT [21743] DEBUG:  proc_exit(-1): 0 callbacks to make
2019-06-01 20:28:46.147 EDT [21744] DEBUG:  logger shutting down
2019-06-01 20:28:46.147 EDT [21744] DEBUG:  shmem_exit(0): 0 before_shmem_exit callbacks to make
2019-06-01 20:28:46.147 EDT [21744] DEBUG:  shmem_exit(0): 0 on_shmem_exit callbacks to make
2019-06-01 20:28:46.147 EDT [21744] DEBUG:  proc_exit(0): 0 callbacks to make
2019-06-01 20:28:46.147 EDT [21744] DEBUG:  exit(0)
2019-06-01 20:28:46.147 EDT [21744] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-06-01 20:28:46.147 EDT [21744] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-06-01 20:28:46.147 EDT [21744] DEBUG:  proc_exit(-1): 0 callbacks to make
^C
-bash-4.2$

Removing the recovery.conf and restarting yields the following messages:  

2019-06-01 20:31:12.231 EDT [21910] LOG:  database system was interrupted; last known up at 2019-04-28 06:06:24 EDT
2019-06-01 20:31:12.505 EDT [21910] LOG:  invalid record length at 0/4C35CDF8: wanted 24, got 0
2019-06-01 20:31:12.505 EDT [21910] LOG:  invalid primary checkpoint record
2019-06-01 20:31:12.505 EDT [21910] LOG:  using previous checkpoint record at 0/4C34EDA8
2019-06-01 20:31:12.505 EDT [21910] DEBUG:  redo record is at 0/4C34ED70; shutdown FALSE
2019-06-01 20:31:12.505 EDT [21910] DEBUG:  next transaction ID: 0:1409831; next OID: 237578
2019-06-01 20:31:12.505 EDT [21910] DEBUG:  next MultiXactId: 48; next MultiXactOffset: 174
2019-06-01 20:31:12.505 EDT [21910] DEBUG:  oldest unfrozen transaction ID: 549, in database 1
2019-06-01 20:31:12.505 EDT [21910] DEBUG:  oldest MultiXactId: 1, in database 1
2019-06-01 20:31:12.505 EDT [21910] DEBUG:  commit timestamp Xid oldest/newest: 0/0
2019-06-01 20:31:12.505 EDT [21910] DEBUG:  transaction ID wrap limit is 2147484196, limited by database with OID 1
2019-06-01 20:31:12.505 EDT [21910] DEBUG:  MultiXactId wrap limit is 2147483648, limited by database with OID 1
2019-06-01 20:31:12.505 EDT [21910] DEBUG:  starting up replication slots
2019-06-01 20:31:12.505 EDT [21910] DEBUG:  starting up replication origin progress state
2019-06-01 20:31:12.505 EDT [21910] PANIC:  replication checkpoint has wrong magic 0 instead of 307747550
2019-06-01 20:31:12.506 EDT [21908] DEBUG:  reaping dead processes
2019-06-01 20:31:12.506 EDT [21908] LOG:  startup process (PID 21910) was terminated by signal 6: Aborted
2019-06-01 20:31:12.506 EDT [21908] LOG:  aborting startup due to startup process failure
2019-06-01 20:31:12.506 EDT [21908] DEBUG:  shmem_exit(1): 0 before_shmem_exit callbacks to make
2019-06-01 20:31:12.506 EDT [21908] DEBUG:  shmem_exit(1): 5 on_shmem_exit callbacks to make
2019-06-01 20:31:12.506 EDT [21908] DEBUG:  cleaning up dynamic shared memory control segment with ID 976986759
2019-06-01 20:31:12.508 EDT [21908] DEBUG:  proc_exit(1): 2 callbacks to make
2019-06-01 20:31:12.508 EDT [21908] LOG:  database system is shut down
2019-06-01 20:31:12.508 EDT [21908] DEBUG:  exit(1)
2019-06-01 20:31:12.508 EDT [21908] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-06-01 20:31:12.508 EDT [21908] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-06-01 20:31:12.508 EDT [21908] DEBUG:  proc_exit(-1): 0 callbacks to make
2019-06-01 20:31:12.510 EDT [21909] DEBUG:  logger shutting down
2019-06-01 20:31:12.510 EDT [21909] DEBUG:  shmem_exit(0): 0 before_shmem_exit callbacks to make
2019-06-01 20:31:12.510 EDT [21909] DEBUG:  shmem_exit(0): 0 on_shmem_exit callbacks to make
2019-06-01 20:31:12.510 EDT [21909] DEBUG:  proc_exit(0): 0 callbacks to make
2019-06-01 20:31:12.510 EDT [21909] DEBUG:  exit(0)
2019-06-01 20:31:12.510 EDT [21909] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-06-01 20:31:12.510 EDT [21909] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-06-01 20:31:12.510 EDT [21909] DEBUG:  proc_exit(-1): 0 callbacks to make

-- 

Adrian Klaver

adrian.klaver@xxxxxxxxxxx