postgres cluster becomes unhealthy after draining one of the worker nodes

Oswald Hirschmueller <ohirschmueller@xxxxxx> · Thu, 15 Apr 2021 22:24:12 +0000

We are trying to upgrade the kubernetes worker nodes, but encountering an issue with the postgres clusters. As soon as our system administrator drains one of the nodes, the postgres cluster becomes unhealthy. We expected another node to
 take over as the master, but this did not occur. Note that the replicas are running in synchronous mode.

This is the command the system adminstrator ran. The issue occurs shortly after.
kubectl drain $NODENAME --delete-local-data --ignore-daemonsets

Here are snippets of the logs of the master and one of the replicas (the log of the second replica looks the same):

Master Log
----------
021-04-15 20:44:29,986 INFO: no action. i am the leader with the lock
2021-04-15 20:44:39,907 INFO: Lock owner: dbo-dhog-5b7dc865b4-lg7c9; I am dbo-dhog-5b7dc865b4-lg7c9
(the master was on the node that was drained)
/tmp:5432 - no response
2021-04-15 20:47:13.940 UTC [186] LOG: pgaudit extension initialized
2021-04-15 20:47:13.942 UTC [186] LOG: starting PostgreSQL 12.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2021-04-15 20:47:13.942 UTC [186] LOG: listening on IPv4 address "0.0.0.0", port 5432
2021-04-15 20:47:14.105 UTC [186] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2021-04-15 20:47:14.277 UTC [186] LOG: listening on Unix socket "/crunchyadm/.s.PGSQL.5432"
2021-04-15 20:47:14.321 UTC [186] LOG: redirecting log output to logging collector process
2021-04-15 20:47:14.321 UTC [186] HINT: Future log output will appear in directory "pg_log".
/tmp:5432 - rejecting connections
/tmp:5432 - rejecting connections
/tmp:5432 - rejecting connections
/tmp:5432 - accepting connections
2021-04-15 20:47:16,996 INFO: establishing a new patroni connection to the postgres cluster
2021-04-15 20:47:17,105 INFO: following a different leader because i am not the healthiest node
Thu Apr 15 20:47:17 UTC 2021 INFO: Node dbo-dhog-5b7dc865b4-d8pv4 fully initialized for cluster dbo and is ready for use
2021-04-15 20:47:27,511 INFO: following a different leader because i am not the healthiest node
2021-04-15 20:47:37,516 INFO: following a different leader because i am not the healthiest node
2021-04-15 20:47:47,511 INFO: following a different leader because i am not the healthiest node

Replica Log
-----------
2021-04-15 20:46:16.618 UTC [28158] LOG: listening on IPv4 address "0.0.0.0", port 5432
/tmp:5432 - no response
2021-04-15 20:46:16.655 UTC [28158] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2021-04-15 20:46:16.697 UTC [28158] LOG: listening on Unix socket "/crunchyadm/.s.PGSQL.5432"
2021-04-15 20:46:16.764 UTC [28158] LOG: redirecting log output to logging collector process
2021-04-15 20:46:16.764 UTC [28158] HINT: Future log output will appear in directory "pg_log".
2021-04-15 20:46:17,061 INFO: Lock owner: None; I am dbo-ffvg-6c5d7c576c-8tcwt
2021-04-15 20:46:17,061 INFO: not healthy enough for leader race
2021-04-15 20:46:17,061 INFO: changing primary_conninfo and restarting in progress
/tmp:5432 - rejecting connections
/tmp:5432 - rejecting connections
/tmp:5432 - accepting connections
2021-04-15 20:46:18,729 INFO: establishing a new patroni connection to the postgres cluster
2021-04-15 20:46:18,871 INFO: following a different leader because i am not the healthiest node
2021-04-15 20:46:29,245 INFO: following a different leader because i am not the healthiest node
2021-04-15 20:46:39,248 INFO: following a different leader because i am not the healthiest node
2021-04-15 20:46:49,249 INFO: following a different leader because i am not the healthiest node
2021-04-15 20:46:59,247 INFO: following a different leader because i am not the healthiest node

What we can do to prevent this situation from occurring? We do not want this to occur when we upgrade our production nodes.

Thanks,
Os
.