etcd failed on master node

daur exp <dauren.naipov@xxxxxxxxx> · Fri, 7 Feb 2025 14:38:06 +0500

Hello Community,I have 3 nodes with in etcd cluster, 2 of them postrges db with in patroni cluster, 3rd node just etcd member. 3 nodes located on 3 different data centers.
I did clone of vm of 1st node after that on it etcd crached somehow. I think due to disk overloaded during cloning of vm. But patroni, postgres is ok, working.

I see such error on etcd: 
Feb 06 15:44:52 prod-pgsql01-uv01 bash[1832370]: panic: tocommit(43265370) is out of range [lastIndex(43231911)]. Was the raft log corrupted, truncated, or lost?

root@prod-pgsql01-uv01:/etc/etcd# systemctl status etcd
× etcd.service - Etcd Server
     Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; preset: disabled)
     Active: failed (Result: exit-code) since Thu 2025-02-06 14:31:50 +05; 8min ago
   Duration: 6month 4w 1d 9h 17min 14.044s
    Process: 1823702 ExecStart=/bin/bash -c GOMAXPROCS=$(nproc) /usr/bin/etcd (code=exited, status=2)
   Main PID: 1823702 (code=exited, status=2)
        CPU: 1.668s

Feb 06 14:31:50 prod-pgsql01-uv01. systemd[1]: etcd.service: Scheduled restart job, restart counter is at 5.
Feb 06 14:31:50 prod-pgsql01-uv01. systemd[1]: Stopped Etcd Server.
Feb 06 14:31:50 prod-pgsql01-uv01. systemd[1]: etcd.service: Consumed 1.668s CPU time.
Feb 06 14:31:50 prod-pgsql01-uv01. systemd[1]: etcd.service: Start request repeated too quickly.
Feb 06 14:31:50 prod-pgsql01-uv01. systemd[1]: etcd.service: Failed with result 'exit-code'.
Feb 06 14:31:50 prod-pgsql01-uv01. systemd[1]: Failed to start Etcd Server.
root@prod-pgsql01-uv01:/etc/etcd# etcdctl member list
{"level":"warn","ts":"2025-02-06T14:40:17.959419+0500","logger":"etcd-client","caller":"v3@v3.5.12/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0000d8700/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
Error: context deadline exceeded

root@prod-pgsql01-uv01:/var/lib/etcd/postgresql/member# patronictl -c /etc/patroni/patroni.yml list
2025-02-07 14:25:57,387 - WARNING - Retrying (Retry(total=1, connect=None, read=None, redirect=0, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe63fdf3610>: Failed to establish a new connection: [Errno 111] Connection refused')': /v2/machines
2025-02-07 14:25:57,387 - WARNING - Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe63fdf3910>: Failed to establish a new connection: [Errno 111] Connection refused')': /v2/machines
2025-02-07 14:25:57,387 - ERROR - Failed to get list of machines from http://10.0.100.29:2379/v2: MaxRetryError("HTTPConnectionPool(host='10.0.100.29', port=2379): Max retries exceeded with url: /v2/machines (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe63fdf3a60>: Failed to establish a new connection: [Errno 111] Connection refused'))")
^C
===========================================================
But on replica output is ok and like it:

root@stb-pgsql02-uv02:/home/ansible# patronictl -c /etc/patroni/patroni.yml list

+ Cluster: patroni-cluster (7389499945753108875) ---------+-----------+----+-----------+

| Member                                       | Host        | Role    | State     | TL | Lag in MB |

+----------------------------------------------+-------------+---------+-----------+----+-----------+

| prod-pgsql01-uv01 | 10.0.100.29 | Leader  | running   |  4 |           |

| stb-pgsql02-uv02  | 10.0.100.31 | Replica | streaming |  4 |         0 |

+----------------------------------------------+-------------+---------+-----------+----+-----------+

root@stb-pgsql02-uv02:/home/ansible# etcdctl member list

164fa7c8b348f043, started, node1, http://10.0.100.29:2380, http://10.0.100.29:2379, false

5b78829cdf24f062, started, node2, http://10.0.100.31:2380, http://10.0.100.31:2379, false

b209a2d81cc1c996, started, node3, http://10.0.225.203:2380, http://10.0.225.203:2379, false

Not sure if i do switchover on replica to do master completes succesfully. May happen split brain or crash patroni cluster.
if i can do on node2:
 patronictl -c /etc/patroni/patroni.yml switchover --candidate node2
then on node1:
  systemctl stop etcd 
rm -rf /var/lib/etcd/member/snap/* 
rm -rf /var/lib/etcd/member/wal/* 
systemctl start etcd  
will help me?

I need your advice how to restore etcd cluster with in patroni cluster?

--
Regards, Dauren