Hello Community,
I have 3 nodes with in etcd cluster, 2 of them postrges db with in patroni cluster, 3rd node just etcd member. 3 nodes located on 3 different data centers.
I did clone of vm of 1st node after that on it etcd crached somehow. I think due to disk overloaded during cloning of vm. But patroni, postgres is ok, working.
I see such error on etcd:
Feb 06 15:44:52 prod-pgsql01-uv01 bash[1832370]: panic: tocommit(43265370) is out of range [lastIndex(43231911)]. Was the raft log corrupted, truncated, or lost?
root@prod-pgsql01-uv01:/etc/etcd# systemctl status etcd
× etcd.service - Etcd Server
Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; preset: disabled)
Active: failed (Result: exit-code) since Thu 2025-02-06 14:31:50 +05; 8min ago
Duration: 6month 4w 1d 9h 17min 14.044s
Process: 1823702 ExecStart=/bin/bash -c GOMAXPROCS=$(nproc) /usr/bin/etcd (code=exited, status=2)
Main PID: 1823702 (code=exited, status=2)
CPU: 1.668s
Feb 06 14:31:50 prod-pgsql01-uv01. systemd[1]: etcd.service: Scheduled restart job, restart counter is at 5.
Feb 06 14:31:50 prod-pgsql01-uv01. systemd[1]: Stopped Etcd Server.
Feb 06 14:31:50 prod-pgsql01-uv01. systemd[1]: etcd.service: Consumed 1.668s CPU time.
Feb 06 14:31:50 prod-pgsql01-uv01. systemd[1]: etcd.service: Start request repeated too quickly.
Feb 06 14:31:50 prod-pgsql01-uv01. systemd[1]: etcd.service: Failed with result 'exit-code'.
Feb 06 14:31:50 prod-pgsql01-uv01. systemd[1]: Failed to start Etcd Server.
root@prod-pgsql01-uv01:/etc/etcd# etcdctl member list
{"level":"warn","ts":"2025-02-06T14:40:17.959419+0500","logger":"etcd-client","caller":"v3@v3.5.12/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0000d8700/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
Error: context deadline exceeded
root@prod-pgsql01-uv01:/var/lib/etcd/postgresql/member# patronictl -c /etc/patroni/patroni.yml list
2025-02-07 14:25:57,387 - WARNING - Retrying (Retry(total=1, connect=None, read=None, redirect=0, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe63fdf3610>: Failed to establish a new connection: [Errno 111] Connection refused')': /v2/machines
2025-02-07 14:25:57,387 - WARNING - Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe63fdf3910>: Failed to establish a new connection: [Errno 111] Connection refused')': /v2/machines
2025-02-07 14:25:57,387 - ERROR - Failed to get list of machines from http://10.0.100.29:2379/v2: MaxRetryError("HTTPConnectionPool(host='10.0.100.29', port=2379): Max retries exceeded with url: /v2/machines (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe63fdf3a60>: Failed to establish a new connection: [Errno 111] Connection refused'))")
^C
× etcd.service - Etcd Server
Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; preset: disabled)
Active: failed (Result: exit-code) since Thu 2025-02-06 14:31:50 +05; 8min ago
Duration: 6month 4w 1d 9h 17min 14.044s
Process: 1823702 ExecStart=/bin/bash -c GOMAXPROCS=$(nproc) /usr/bin/etcd (code=exited, status=2)
Main PID: 1823702 (code=exited, status=2)
CPU: 1.668s
Feb 06 14:31:50 prod-pgsql01-uv01. systemd[1]: etcd.service: Scheduled restart job, restart counter is at 5.
Feb 06 14:31:50 prod-pgsql01-uv01. systemd[1]: Stopped Etcd Server.
Feb 06 14:31:50 prod-pgsql01-uv01. systemd[1]: etcd.service: Consumed 1.668s CPU time.
Feb 06 14:31:50 prod-pgsql01-uv01. systemd[1]: etcd.service: Start request repeated too quickly.
Feb 06 14:31:50 prod-pgsql01-uv01. systemd[1]: etcd.service: Failed with result 'exit-code'.
Feb 06 14:31:50 prod-pgsql01-uv01. systemd[1]: Failed to start Etcd Server.
root@prod-pgsql01-uv01:/etc/etcd# etcdctl member list
{"level":"warn","ts":"2025-02-06T14:40:17.959419+0500","logger":"etcd-client","caller":"v3@v3.5.12/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0000d8700/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
Error: context deadline exceeded
root@prod-pgsql01-uv01:/var/lib/etcd/postgresql/member# patronictl -c /etc/patroni/patroni.yml list
2025-02-07 14:25:57,387 - WARNING - Retrying (Retry(total=1, connect=None, read=None, redirect=0, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe63fdf3610>: Failed to establish a new connection: [Errno 111] Connection refused')': /v2/machines
2025-02-07 14:25:57,387 - WARNING - Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe63fdf3910>: Failed to establish a new connection: [Errno 111] Connection refused')': /v2/machines
2025-02-07 14:25:57,387 - ERROR - Failed to get list of machines from http://10.0.100.29:2379/v2: MaxRetryError("HTTPConnectionPool(host='10.0.100.29', port=2379): Max retries exceeded with url: /v2/machines (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe63fdf3a60>: Failed to establish a new connection: [Errno 111] Connection refused'))")
^C
===========================================================
But on replica output is ok and like it:
root@stb-pgsql02-uv02:/home/ansible# patronictl -c /etc/patroni/patroni.yml list
+ Cluster: patroni-cluster (7389499945753108875) ---------+-----------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+----------------------------------------------+-------------+---------+-----------+----+-----------+
| prod-pgsql01-uv01 | 10.0.100.29 | Leader | running | 4 | |
| stb-pgsql02-uv02 | 10.0.100.31 | Replica | streaming | 4 | 0 |
+----------------------------------------------+-------------+---------+-----------+----+-----------+
root@stb-pgsql02-uv02:/home/ansible# etcdctl member list
164fa7c8b348f043, started, node1, http://10.0.100.29:2380, http://10.0.100.29:2379, false
5b78829cdf24f062, started, node2, http://10.0.100.31:2380, http://10.0.100.31:2379, false
b209a2d81cc1c996, started, node3, http://10.0.225.203:2380, http://10.0.225.203:2379, false
+ Cluster: patroni-cluster (7389499945753108875) ---------+-----------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+----------------------------------------------+-------------+---------+-----------+----+-----------+
| prod-pgsql01-uv01 | 10.0.100.29 | Leader | running | 4 | |
| stb-pgsql02-uv02 | 10.0.100.31 | Replica | streaming | 4 | 0 |
+----------------------------------------------+-------------+---------+-----------+----+-----------+
root@stb-pgsql02-uv02:/home/ansible# etcdctl member list
164fa7c8b348f043, started, node1, http://10.0.100.29:2380, http://10.0.100.29:2379, false
5b78829cdf24f062, started, node2, http://10.0.100.31:2380, http://10.0.100.31:2379, false
b209a2d81cc1c996, started, node3, http://10.0.225.203:2380, http://10.0.225.203:2379, false
Not sure if i do switchover on replica to do master completes succesfully. May happen split brain or crash patroni cluster.
if i can do on node2:
patronictl -c /etc/patroni/patroni.yml switchover --candidate node2
then on node1:
systemctl stop etcd
rm -rf /var/lib/etcd/member/snap/*
rm -rf /var/lib/etcd/member/wal/*
systemctl start etcd
will help me?
I need your advice how to restore etcd cluster with in patroni cluster?
--
Regards, Dauren