On 5/27/19 9:59 PM, Tom K wrote:
Hey Guy's,
I'm running Patroni w/ PostgreSQL 10, ETCD, Haproxy and Keepalived on 3
RHEL 7.6 VM's. Every now and then the underlying storage crashes taking
out the cluster. On recovery, PostgreSQL tends to come up while other
databases just blow up. That is until now.
The above is one master and two standby servers connected via streaming
replication, correct?
Is there WAL archiving set up?
After the most recent crash 2-3 weeks ago, the cluster is now running
into this message but I'm not able to make heads or tails out of why
it's throwing this:
So you have not been able to run the cluster the past 2-3 weeks or is
that more recent?
When you refer to history files below are you talking about WAL files or
something else?
Is this:
"recovery command file "recovery.conf" specified neither
primary_conninfo nor restore_command"
true?
|-bash-4.2$/usr/pgsql-10/bin/postgres -D /data/patroni
--config-file=/data/patroni/postgresql.conf2019-05-2308:26:34.561EDT
[10101]LOG:listening onIPv4 address "192.168.0.108",port
54322019-05-2308:26:34.565EDT [10101]LOG:listening onUnix socket
"./.s.PGSQL.5432"2019-05-2308:26:34.591EDT [10101]LOG:redirecting log
output tologging collector process 2019-05-2308:26:34.591EDT
[10101]HINT:Future log output will appear indirectory
"log".^C-bash-4.2/usr/pgsql-10/bin/postgres -D /data/patroni
--config-file=/data/patroni/postgresql.conf --hot_standby=on
--listen_addresses=192.168.0.108 --max_worker_processes=8
--max_locks_per_transaction=64 --wal_level=replica
--cluster_name=postgres --wal_log_hints=on --max_wal_senders=10
--track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
--max_replication_slots=10
--max_connections=1002019-05-2308:26:45.707EDT [10105]LOG:listening
onIPv4 address "192.168.0.108",port 54322019-05-2308:26:45.715EDT
[10105]LOG:listening onUnix socket
"./.s.PGSQL.5432"2019-05-2308:26:45.740EDT [10105]LOG:redirecting log
output tologging collector process 2019-05-2308:26:45.740EDT
[10105]HINT:Future log output will appear indirectory
"log".^C-bash-4.2$/usr/pgsql-10/bin/postgres -D /data/patroni
--config-file=/data/patroni/postgresql.conf --hot_standby=on
--listen_addresses=192.168.0.1 --max_worker_processes=8
--max_locks_per_transaction=64 --wal_level=replica
--cluster_name=postgres --wal_log_hints=on --max_wal_senders=10
--track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
--max_replication_slots=10
--max_connections=1002019-05-2308:29:36.014EDT [10188]LOG:listening
onIPv4 address "192.168.0.108",port 54322019-05-2308:29:36.017EDT
[10188]LOG:listening onUnix socket
"./.s.PGSQL.5432"2019-05-2308:29:36.040EDT [10188]LOG:redirecting log
output tologging collector process 2019-05-2308:29:36.040EDT
[10188]HINT:Future log output will appear indirectory "log".|
The database logs only print out this:
|2019-05-2308:25:44.476EDT [10088]HINT:The databaseserver will regularly
poll the pg_wal subdirectory tocheckforfiles placed
there.2019-05-2308:25:44.477EDT [10088]LOG:entering standby mode
2019-05-2308:25:50.209EDT [10086]LOG:received fast shutdownrequest
2019-05-2308:25:50.215EDT [10089]LOG:shutting down
2019-05-2308:25:50.221EDT [10086]LOG:databasesystem isshut down
2019-05-2308:26:11.560EDT [10094]LOG:databasesystem was shut down
inrecovery at 2019-05-2308:25:50EDT 2019-05-2308:26:11.560EDT
[10094]WARNING:recovery command file"recovery.conf"specified neither
primary_conninfo nor restore_command 2019-05-2308:26:11.560EDT
[10094]HINT:The databaseserver will regularly poll the pg_wal
subdirectory tocheckforfiles placed there.2019-05-2308:26:11.562EDT
[10094]LOG:entering standby mode 2019-05-2308:26:17.817EDT
[10092]LOG:received fast shutdownrequest 2019-05-2308:26:17.824EDT
[10095]LOG:shutting down 2019-05-2308:26:17.831EDT
[10092]LOG:databasesystem isshut down 2019-05-2308:26:34.596EDT
[10103]LOG:databasesystem was shut down inrecovery at
2019-05-2308:26:17EDT 2019-05-2308:26:34.596EDT [10103]WARNING:recovery
command file"recovery.conf"specified neither primary_conninfo nor
restore_command 2019-05-2308:26:34.596EDT [10103]HINT:The databaseserver
will regularly poll the pg_wal subdirectory tocheckforfiles placed
there.2019-05-2308:26:34.598EDT [10103]LOG:entering standby mode
2019-05-2308:26:43.025EDT [10101]LOG:received fast shutdownrequest
2019-05-2308:26:43.032EDT [10104]LOG:shutting down
2019-05-2308:26:43.038EDT [10101]LOG:databasesystem isshut down
2019-05-2308:26:45.748EDT [10107]LOG:databasesystem was shut down
inrecovery at 2019-05-2308:26:43EDT 2019-05-2308:26:45.748EDT
[10107]WARNING:recovery command file"recovery.conf"specified neither
primary_conninfo nor restore_command 2019-05-2308:26:45.748EDT
[10107]HINT:The databaseserver will regularly poll the pg_wal
subdirectory tocheckforfiles placed there.|
and when connecting, I get this cryptic message:
|[root@psql01 log]#psql -U postgres -h 192.168.0.108psql:FATAL:the
databasesystem isstarting up [root@psql01 log]#|
There's no space issue here. The cluster stays in recovery. This state
doesn't change and the DB is inaccessible.
So I ran the above in debug mode ( -d 5 ) and got this:
|DEBUG:mmap(148897792)withMAP_HUGETLB failed,huge pages disabled:Cannot
allocate memory|
so I freed memory on the underlying physical this VM sits on but no
luck. Same memory error and same result as above.
When patroni is started on just this one node (troubleshooting purposes)
I get the message:
|[root@psql01 ~]#ps -ef|grep -Ei "patroni|postgres"root
22172188000:38pts/100:00:00tail -f postgresql-Thu.log postgres
25121400:42?00:00:01/usr/bin/python2 /bin/patroni /etc/patroni.yml
postgres 25331000:42?00:00:00/usr/pgsql-10/bin/postgres -D /data/patroni
--config-file=/data/patroni/postgresql.conf --hot_standby=on
--listen_addresses=192.168.0.108 --max_worker_processes=8
--max_locks_per_transaction=64 --wal_level=replica
--cluster_name=postgres --wal_log_hints=on --max_wal_senders=10
--track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
--max_replication_slots=10 --max_connections=100postgres
25352533000:42?00:00:00postgres:postgres:logger process postgres
25362533000:42?00:00:00postgres:postgres:startup process waiting
for000000010000000000000008root 26642039000:42pts/000:00:00grep
--color=auto -Ei patroni|postgres[root@psql01 ~]#|
The system prints "postgres: postgres: startup process waiting for
000000010000000000000008"
Seems it's trying to replay the history log but can't? Running an strace
on the file reveals that it's looking for the following:
|read(5,0x7fff9cb4eb87,1)=-1EAGAIN (Resource temporarily
unavailable)read(5,0x7fff9cb4eb87,1)=-1EAGAIN (Resource temporarily
unavailable)open("pg_wal/00000098.history",O_RDONLY)=-1ENOENT (No such
fileordirectory)epoll_create1(EPOLL_CLOEXEC)=3|
But this file doesn't exist in the directory on this particular cluster
node (psql01). I copy the missing history files off another cluster
node to this one and it does move forward but just get's stuck on
another history file. But same error message as above on psql01.
So I give up and try to recover the secondary node (psql02). But this
yields this error:
|PANIC:replicationcheckpointhas wrong magic 0instead of307747550|
and I'm not able to find very much to move me forward on this one either.
So again I give up after a few hours of trying to recover psql02 and
move on to the third node. On the third cluster node (pgsql03) I get
this error which I'm also not able to find much info on:
|FATAL:syntax error inhistory file:f2W|
Hoping I'm able to recover the data from at least one of the nodes, make
it primary, and get it to replicate to the others.
Thx, TK
--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx