We are trying to upgrade the kubernetes worker nodes, but encountering an issue with the postgres clusters. As soon as our system administrator drains one of the nodes, the postgres cluster becomes unhealthy. We expected another node to
take over as the master, but this did not occur. Note that the replicas are running in synchronous mode. This is the command the system adminstrator ran. The issue occurs shortly after. kubectl drain $NODENAME --delete-local-data --ignore-daemonsets Here are snippets of the logs of the master and one of the replicas (the log of the second replica looks the same): Master Log ---------- 021-04-15 20:44:29,986 INFO: no action. i am the leader with the lock 2021-04-15 20:44:39,907 INFO: Lock owner: dbo-dhog-5b7dc865b4-lg7c9; I am dbo-dhog-5b7dc865b4-lg7c9 (the master was on the node that was drained) /tmp:5432 - no response 2021-04-15 20:47:13.940 UTC [186] LOG: pgaudit extension initialized 2021-04-15 20:47:13.942 UTC [186] LOG: starting PostgreSQL 12.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit 2021-04-15 20:47:13.942 UTC [186] LOG: listening on IPv4 address "0.0.0.0", port 5432 2021-04-15 20:47:14.105 UTC [186] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" 2021-04-15 20:47:14.277 UTC [186] LOG: listening on Unix socket "/crunchyadm/.s.PGSQL.5432" 2021-04-15 20:47:14.321 UTC [186] LOG: redirecting log output to logging collector process 2021-04-15 20:47:14.321 UTC [186] HINT: Future log output will appear in directory "pg_log". /tmp:5432 - rejecting connections /tmp:5432 - rejecting connections /tmp:5432 - rejecting connections /tmp:5432 - accepting connections 2021-04-15 20:47:16,996 INFO: establishing a new patroni connection to the postgres cluster 2021-04-15 20:47:17,105 INFO: following a different leader because i am not the healthiest node Thu Apr 15 20:47:17 UTC 2021 INFO: Node dbo-dhog-5b7dc865b4-d8pv4 fully initialized for cluster dbo and is ready for use 2021-04-15 20:47:27,511 INFO: following a different leader because i am not the healthiest node 2021-04-15 20:47:37,516 INFO: following a different leader because i am not the healthiest node 2021-04-15 20:47:47,511 INFO: following a different leader because i am not the healthiest node Replica Log ----------- 2021-04-15 20:46:16.618 UTC [28158] LOG: listening on IPv4 address "0.0.0.0", port 5432 /tmp:5432 - no response 2021-04-15 20:46:16.655 UTC [28158] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" 2021-04-15 20:46:16.697 UTC [28158] LOG: listening on Unix socket "/crunchyadm/.s.PGSQL.5432" 2021-04-15 20:46:16.764 UTC [28158] LOG: redirecting log output to logging collector process 2021-04-15 20:46:16.764 UTC [28158] HINT: Future log output will appear in directory "pg_log". 2021-04-15 20:46:17,061 INFO: Lock owner: None; I am dbo-ffvg-6c5d7c576c-8tcwt 2021-04-15 20:46:17,061 INFO: not healthy enough for leader race 2021-04-15 20:46:17,061 INFO: changing primary_conninfo and restarting in progress /tmp:5432 - rejecting connections /tmp:5432 - rejecting connections /tmp:5432 - accepting connections 2021-04-15 20:46:18,729 INFO: establishing a new patroni connection to the postgres cluster 2021-04-15 20:46:18,871 INFO: following a different leader because i am not the healthiest node 2021-04-15 20:46:29,245 INFO: following a different leader because i am not the healthiest node 2021-04-15 20:46:39,248 INFO: following a different leader because i am not the healthiest node 2021-04-15 20:46:49,249 INFO: following a different leader because i am not the healthiest node 2021-04-15 20:46:59,247 INFO: following a different leader because i am not the healthiest node What we can do to prevent this situation from occurring? We do not want this to occur when we upgrade our production nodes.
Thanks, Os . |