Re: psql: FATAL: the database system is starting up

Adrian Klaver <adrian.klaver@xxxxxxxxxxx> · Tue, 28 May 2019 06:53:32 -0700

On 5/27/19 9:59 PM, Tom K wrote:
Hey Guy's,

I'm running Patroni w/ PostgreSQL 10, ETCD, Haproxy and Keepalived on 3 
RHEL 7.6 VM's.  Every now and then the underlying storage crashes taking 
out the cluster.  On recovery, PostgreSQL tends to come up while other 
databases just blow up.  That is until now.

The above is one master and two standby servers connected via streaming 
replication, correct?

Is there WAL archiving set up?

After the most recent crash 2-3 weeks ago, the cluster is now running 
into this message but I'm not able to make heads or tails out of why 
it's throwing this:

So you have not been able to run the cluster the past 2-3 weeks or is 
that  more recent?

When you refer to history files below are you talking about WAL files or 
something else?

Is this:

"recovery command file "recovery.conf" specified neither 
primary_conninfo nor restore_command"

true?

|-bash-4.2$/usr/pgsql-10/bin/postgres -D /data/patroni 
--config-file=/data/patroni/postgresql.conf2019-05-2308:26:34.561EDT 
[10101]LOG:listening onIPv4 address "192.168.0.108",port 
54322019-05-2308:26:34.565EDT [10101]LOG:listening onUnix socket 
"./.s.PGSQL.5432"2019-05-2308:26:34.591EDT [10101]LOG:redirecting log 
output tologging collector process 2019-05-2308:26:34.591EDT 
[10101]HINT:Future log output will appear indirectory 
"log".^C-bash-4.2/usr/pgsql-10/bin/postgres -D /data/patroni 
--config-file=/data/patroni/postgresql.conf --hot_standby=on 
--listen_addresses=192.168.0.108 --max_worker_processes=8 
--max_locks_per_transaction=64 --wal_level=replica 
--cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 
--track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 
--max_replication_slots=10 
--max_connections=1002019-05-2308:26:45.707EDT [10105]LOG:listening 
onIPv4 address "192.168.0.108",port 54322019-05-2308:26:45.715EDT 
[10105]LOG:listening onUnix socket 
"./.s.PGSQL.5432"2019-05-2308:26:45.740EDT [10105]LOG:redirecting log 
output tologging collector process 2019-05-2308:26:45.740EDT 
[10105]HINT:Future log output will appear indirectory 
"log".^C-bash-4.2$/usr/pgsql-10/bin/postgres -D /data/patroni 
--config-file=/data/patroni/postgresql.conf --hot_standby=on 
--listen_addresses=192.168.0.1 --max_worker_processes=8 
--max_locks_per_transaction=64 --wal_level=replica 
--cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 
--track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 
--max_replication_slots=10 
--max_connections=1002019-05-2308:29:36.014EDT [10188]LOG:listening 
onIPv4 address "192.168.0.108",port 54322019-05-2308:29:36.017EDT 
[10188]LOG:listening onUnix socket 
"./.s.PGSQL.5432"2019-05-2308:29:36.040EDT [10188]LOG:redirecting log 
output tologging collector process 2019-05-2308:29:36.040EDT 
[10188]HINT:Future log output will appear indirectory "log".|

The database logs only print out this:

|2019-05-2308:25:44.476EDT [10088]HINT:The databaseserver will regularly 
poll the pg_wal subdirectory tocheckforfiles placed 
there.2019-05-2308:25:44.477EDT [10088]LOG:entering standby mode 
2019-05-2308:25:50.209EDT [10086]LOG:received fast shutdownrequest 
2019-05-2308:25:50.215EDT [10089]LOG:shutting down 
2019-05-2308:25:50.221EDT [10086]LOG:databasesystem isshut down 
2019-05-2308:26:11.560EDT [10094]LOG:databasesystem was shut down 
inrecovery at 2019-05-2308:25:50EDT 2019-05-2308:26:11.560EDT 
[10094]WARNING:recovery command file"recovery.conf"specified neither 
primary_conninfo nor restore_command 2019-05-2308:26:11.560EDT 
[10094]HINT:The databaseserver will regularly poll the pg_wal 
subdirectory tocheckforfiles placed there.2019-05-2308:26:11.562EDT 
[10094]LOG:entering standby mode 2019-05-2308:26:17.817EDT 
[10092]LOG:received fast shutdownrequest 2019-05-2308:26:17.824EDT 
[10095]LOG:shutting down 2019-05-2308:26:17.831EDT 
[10092]LOG:databasesystem isshut down 2019-05-2308:26:34.596EDT 
[10103]LOG:databasesystem was shut down inrecovery at 
2019-05-2308:26:17EDT 2019-05-2308:26:34.596EDT [10103]WARNING:recovery 
command file"recovery.conf"specified neither primary_conninfo nor 
restore_command 2019-05-2308:26:34.596EDT [10103]HINT:The databaseserver 
will regularly poll the pg_wal subdirectory tocheckforfiles placed 
there.2019-05-2308:26:34.598EDT [10103]LOG:entering standby mode 
2019-05-2308:26:43.025EDT [10101]LOG:received fast shutdownrequest 
2019-05-2308:26:43.032EDT [10104]LOG:shutting down 
2019-05-2308:26:43.038EDT [10101]LOG:databasesystem isshut down 
2019-05-2308:26:45.748EDT [10107]LOG:databasesystem was shut down 
inrecovery at 2019-05-2308:26:43EDT 2019-05-2308:26:45.748EDT 
[10107]WARNING:recovery command file"recovery.conf"specified neither 
primary_conninfo nor restore_command 2019-05-2308:26:45.748EDT 
[10107]HINT:The databaseserver will regularly poll the pg_wal 
subdirectory tocheckforfiles placed there.|

and when connecting, I get this cryptic message:

|[root@psql01 log]#psql -U postgres -h 192.168.0.108psql:FATAL:the 
databasesystem isstarting up [root@psql01 log]#|

There's no space issue here.  The cluster stays in recovery. This state 
doesn't change and the DB is inaccessible.

So I ran the above in debug mode ( -d 5 ) and got this:

|DEBUG:mmap(148897792)withMAP_HUGETLB failed,huge pages disabled:Cannot 
allocate memory|

so I freed memory on the underlying physical this VM sits on but no 
luck. Same memory error and same result as above.

When patroni is started on just this one node (troubleshooting purposes) 
I get the message:

|[root@psql01 ~]#ps -ef|grep -Ei "patroni|postgres"root 
22172188000:38pts/100:00:00tail -f postgresql-Thu.log postgres 
25121400:42?00:00:01/usr/bin/python2 /bin/patroni /etc/patroni.yml 
postgres 25331000:42?00:00:00/usr/pgsql-10/bin/postgres -D /data/patroni 
--config-file=/data/patroni/postgresql.conf --hot_standby=on 
--listen_addresses=192.168.0.108 --max_worker_processes=8 
--max_locks_per_transaction=64 --wal_level=replica 
--cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 
--track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 
--max_replication_slots=10 --max_connections=100postgres 
25352533000:42?00:00:00postgres:postgres:logger process postgres 
25362533000:42?00:00:00postgres:postgres:startup process waiting 
for000000010000000000000008root 26642039000:42pts/000:00:00grep 
--color=auto -Ei patroni|postgres[root@psql01 ~]#|

The system prints "postgres: postgres: startup process waiting for 
000000010000000000000008"

Seems it's trying to replay the history log but can't? Running an strace 
on the file reveals that it's looking for the following:

|read(5,0x7fff9cb4eb87,1)=-1EAGAIN (Resource temporarily 
unavailable)read(5,0x7fff9cb4eb87,1)=-1EAGAIN (Resource temporarily 
unavailable)open("pg_wal/00000098.history",O_RDONLY)=-1ENOENT (No such 
fileordirectory)epoll_create1(EPOLL_CLOEXEC)=3|

But this file doesn't exist in the directory on this particular cluster 
node (psql01).  I copy the missing history files off another cluster 
node to this one and it does move forward but just get's stuck on 
another history file.  But same error message as above on psql01.

So I give up and try to recover the secondary node (psql02).  But this 
yields this error:

|PANIC:replicationcheckpointhas wrong magic 0instead of307747550|

and I'm not able to find very much to move me forward on this one either.

So again I give up after a few hours of trying to recover psql02 and 
move on to the third node.  On the third cluster node (pgsql03) I get 
this error which I'm also not able to find much info on:

|FATAL:syntax error inhistory file:f2W|

Hoping I'm able to recover the data from at least one of the nodes, make 
it primary, and get it to replicate to the others.

Thx, TK

--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx