CephFS - MDS removed from map - filesystem keeps to be stopped

Denis Polom <denispolom@xxxxxxxxx> · Wed, 22 Nov 2023 15:41:20 +0100

Hi

running Ceph Pacific 16.2.13.

we had full CephFS filesystem and after adding new HW we tried to start 
it but our MDS daemons are pushed to be standby and are removed from MDS 
map.

Filesystem was broken, so we repaired it with:

# ceph fs fail cephfs

# cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary

# cephfs-journal-tool --rank=cephfs:0 journal reset

then I started ceph-mds service

and marked rank as repaired

mds after some time has switched to standby. Log is bellow.

I would appreciate any help to resolve this situation. Thank you.

from log:

2023-11-22T14:11:49.212+0100 7f5dc155e700  1 mds.0.9604 handle_mds_map i 
am now mds.0.9604
2023-11-22T14:11:49.212+0100 7f5dc155e700  1 mds.0.9604 handle_mds_map 
state change up:rejoin --> up:active
2023-11-22T14:11:49.212+0100 7f5dc155e700  1 mds.0.9604 recovery_done -- 
successful recovery!
2023-11-22T14:11:49.212+0100 7f5dc155e700  1 mds.0.9604 active_start
2023-11-22T14:11:49.216+0100 7f5dc155e700  1 mds.0.9604 cluster recovered.
2023-11-22T14:11:49.216+0100 7f5dc3d63700  0 --1- 
[v2:10.245.4.103:6800/1548097835,v1:10.245.4.103:6801/1548097835] >> 
v1:10.245.8.127:0/2123529386 conn(0x55a60627a800 0x55a606e5b000 :6801 
s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).
handle_connect_message_2 accept peer reset, then tried to connect to us, 
replacing
2023-11-22T14:11:49.216+0100 7f5dc4564700  0 --1- 
[v2:10.245.4.103:6800/1548097835,v1:10.245.4.103:6801/1548097835] >> 
v1:10.245.6.88:0/1899426587 conn(0x55a60627ac00 0x55a6070d0000 :6801 
s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).h
andle_connect_message_2 accept peer reset, then tried to connect to us, 
replacing
2023-11-22T14:11:49.216+0100 7f5dc4564700  0 --1- 
[v2:10.245.4.103:6800/1548097835,v1:10.245.4.103:6801/1548097835] >> 
v1:10.245.4.216:0/2058542052 conn(0x55a6070c9800 0x55a6070d1800 :6801 
s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).
handle_connect_message_2 accept peer reset, then tried to connect to us, 
replacing
2023-11-22T14:11:49.216+0100 7f5dc3d63700  0 --1- 
[v2:10.245.4.103:6800/1548097835,v1:10.245.4.103:6801/1548097835] >> 
v1:10.245.4.220:0/1549374180 conn(0x55a60708d000 0x55a6070d0800 :6801 
s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).
handle_connect_message_2 accept peer reset, then tried to connect to us, 
replacing
2023-11-22T14:11:49.216+0100 7f5dc4d65700  0 --1- 
[v2:10.245.4.103:6800/1548097835,v1:10.245.4.103:6801/1548097835] >> 
v1:10.245.8.180:0/270666178 conn(0x55a60703a000 0x55a6070cf800 :6801 
s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).h
andle_connect_message_2 accept peer reset, then tried to connect to us, 
replacing
2023-11-22T14:11:49.216+0100 7f5dc4d65700  0 --1- 
[v2:10.245.4.103:6800/1548097835,v1:10.245.4.103:6801/1548097835] >> 
v1:10.245.8.178:0/3673271488 conn(0x55a6070c9400 0x55a6070d1000 :6801 
s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).
handle_connect_message_2 accept peer reset, then tried to connect to us, 
replacing
2023-11-22T14:11:49.216+0100 7f5dc4d65700  0 --1- 
[v2:10.245.4.103:6800/1548097835,v1:10.245.4.103:6801/1548097835] >> 
v1:10.245.4.167:0/2667964940 conn(0x55a6070c9c00 0x55a607112000 :6801 
s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).
handle_connect_message_2 accept peer reset, then tried to connect to us, 
replacing
2023-11-22T14:11:49.216+0100 7f5dc3d63700  0 --1- 
[v2:10.245.4.103:6800/1548097835,v1:10.245.4.103:6801/1548097835] >> 
v1:10.245.6.70:0/3181830075 conn(0x55a607116000 0x55a607112800 :6801 
s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).h
andle_connect_message_2 accept peer reset, then tried to connect to us, 
replacing
2023-11-22T14:11:49.216+0100 7f5dc4564700  0 --1- 
[v2:10.245.4.103:6800/1548097835,v1:10.245.4.103:6801/1548097835] >> 
v1:10.245.6.72:0/3744737352 conn(0x55a60627a800 0x55a606e5b000 :6801 
s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).h
andle_connect_message_2 accept peer reset, then tried to connect to us, 
replacing
2023-11-22T14:11:49.216+0100 7f5dc3d63700  0 --1- 
[v2:10.245.4.103:6800/1548097835,v1:10.245.4.103:6801/1548097835] >> 
v1:10.244.18.140:0/1607447464 conn(0x55a60627ac00 0x55a6070d0000 :6801 
s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0)
.handle_connect_message_2 accept peer reset, then tried to connect to 
us, replacing
2023-11-22T14:11:49.220+0100 7f5dc155e700  1 mds.mds1 Updating MDS map 
to version 9608 from mon.1
2023-11-22T14:11:49.220+0100 7f5dc155e700  1 mds.0.9604 handle_mds_map i 
am now mds.0.9604
2023-11-22T14:11:49.220+0100 7f5dc155e700  1 mds.0.9604 handle_mds_map 
state change up:active --> up:stopping
2023-11-22T14:11:52.412+0100 7f5dc3562700  1 mds.mds1 asok_command: 
client ls {prefix=client ls} (starting...)
2023-11-22T14:11:57.412+0100 7f5dc3562700  1 mds.mds1 asok_command: 
client ls {prefix=client ls} (starting...)
2023-11-22T14:12:02.416+0100 7f5dc3562700  1 mds.mds1 asok_command: 
client ls {prefix=client ls} (starting...)
2023-11-22T14:12:07.420+0100 7f5dc3562700  1 mds.mds1 asok_command: 
client ls {prefix=client ls} (starting...)
2023-11-22T14:12:12.420+0100 7f5dc3562700  1 mds.mds1 asok_command: 
client ls {prefix=client ls} (starting...)
2023-11-22T14:12:13.552+0100 7f5dc155e700  1 mds.mds1 Updating MDS map 
to version 9609 from mon.1
2023-11-22T14:12:13.552+0100 7f5dc155e700  1 mds.mds1 Map removed me 
[mds.mds1{0:5320528} state up:stopping seq 67 addr 
[v2:10.245.4.103:6800/1548097835,v1:10.245.4.103:6801/1548097835] compat 
{c=[1],r=[1],i=[7ff]}] from cluster; respawnin
g! See cluster/monitor logs for details.
2023-11-22T14:12:13.552+0100 7f5dc155e700  1 mds.mds1 respawn!

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx