CephFS - MDS removed from map - filesystem keeps to be stopped

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi

running Ceph Pacific 16.2.13.

we had full CephFS filesystem and after adding new HW we tried to start it but our MDS daemons are pushed to be standby and are removed from MDS map.

Filesystem was broken, so we repaired it with:

# ceph fs fail cephfs

# cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary

# cephfs-journal-tool --rank=cephfs:0 journal reset

then I started ceph-mds service

and marked rank as repaired

mds after some time has switched to standby. Log is bellow.

I would appreciate any help to resolve this situation. Thank you.

from log:

2023-11-22T14:11:49.212+0100 7f5dc155e700  1 mds.0.9604 handle_mds_map i am now mds.0.9604 2023-11-22T14:11:49.212+0100 7f5dc155e700  1 mds.0.9604 handle_mds_map state change up:rejoin --> up:active 2023-11-22T14:11:49.212+0100 7f5dc155e700  1 mds.0.9604 recovery_done -- successful recovery!
2023-11-22T14:11:49.212+0100 7f5dc155e700  1 mds.0.9604 active_start
2023-11-22T14:11:49.216+0100 7f5dc155e700  1 mds.0.9604 cluster recovered.
2023-11-22T14:11:49.216+0100 7f5dc3d63700  0 --1- [v2:10.245.4.103:6800/1548097835,v1:10.245.4.103:6801/1548097835] >> v1:10.245.8.127:0/2123529386 conn(0x55a60627a800 0x55a606e5b000 :6801 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0). handle_connect_message_2 accept peer reset, then tried to connect to us, replacing 2023-11-22T14:11:49.216+0100 7f5dc4564700  0 --1- [v2:10.245.4.103:6800/1548097835,v1:10.245.4.103:6801/1548097835] >> v1:10.245.6.88:0/1899426587 conn(0x55a60627ac00 0x55a6070d0000 :6801 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).h andle_connect_message_2 accept peer reset, then tried to connect to us, replacing 2023-11-22T14:11:49.216+0100 7f5dc4564700  0 --1- [v2:10.245.4.103:6800/1548097835,v1:10.245.4.103:6801/1548097835] >> v1:10.245.4.216:0/2058542052 conn(0x55a6070c9800 0x55a6070d1800 :6801 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0). handle_connect_message_2 accept peer reset, then tried to connect to us, replacing 2023-11-22T14:11:49.216+0100 7f5dc3d63700  0 --1- [v2:10.245.4.103:6800/1548097835,v1:10.245.4.103:6801/1548097835] >> v1:10.245.4.220:0/1549374180 conn(0x55a60708d000 0x55a6070d0800 :6801 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0). handle_connect_message_2 accept peer reset, then tried to connect to us, replacing 2023-11-22T14:11:49.216+0100 7f5dc4d65700  0 --1- [v2:10.245.4.103:6800/1548097835,v1:10.245.4.103:6801/1548097835] >> v1:10.245.8.180:0/270666178 conn(0x55a60703a000 0x55a6070cf800 :6801 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).h andle_connect_message_2 accept peer reset, then tried to connect to us, replacing 2023-11-22T14:11:49.216+0100 7f5dc4d65700  0 --1- [v2:10.245.4.103:6800/1548097835,v1:10.245.4.103:6801/1548097835] >> v1:10.245.8.178:0/3673271488 conn(0x55a6070c9400 0x55a6070d1000 :6801 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0). handle_connect_message_2 accept peer reset, then tried to connect to us, replacing 2023-11-22T14:11:49.216+0100 7f5dc4d65700  0 --1- [v2:10.245.4.103:6800/1548097835,v1:10.245.4.103:6801/1548097835] >> v1:10.245.4.167:0/2667964940 conn(0x55a6070c9c00 0x55a607112000 :6801 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0). handle_connect_message_2 accept peer reset, then tried to connect to us, replacing 2023-11-22T14:11:49.216+0100 7f5dc3d63700  0 --1- [v2:10.245.4.103:6800/1548097835,v1:10.245.4.103:6801/1548097835] >> v1:10.245.6.70:0/3181830075 conn(0x55a607116000 0x55a607112800 :6801 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).h andle_connect_message_2 accept peer reset, then tried to connect to us, replacing 2023-11-22T14:11:49.216+0100 7f5dc4564700  0 --1- [v2:10.245.4.103:6800/1548097835,v1:10.245.4.103:6801/1548097835] >> v1:10.245.6.72:0/3744737352 conn(0x55a60627a800 0x55a606e5b000 :6801 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).h andle_connect_message_2 accept peer reset, then tried to connect to us, replacing 2023-11-22T14:11:49.216+0100 7f5dc3d63700  0 --1- [v2:10.245.4.103:6800/1548097835,v1:10.245.4.103:6801/1548097835] >> v1:10.244.18.140:0/1607447464 conn(0x55a60627ac00 0x55a6070d0000 :6801 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0) .handle_connect_message_2 accept peer reset, then tried to connect to us, replacing 2023-11-22T14:11:49.220+0100 7f5dc155e700  1 mds.mds1 Updating MDS map to version 9608 from mon.1 2023-11-22T14:11:49.220+0100 7f5dc155e700  1 mds.0.9604 handle_mds_map i am now mds.0.9604 2023-11-22T14:11:49.220+0100 7f5dc155e700  1 mds.0.9604 handle_mds_map state change up:active --> up:stopping 2023-11-22T14:11:52.412+0100 7f5dc3562700  1 mds.mds1 asok_command: client ls {prefix=client ls} (starting...) 2023-11-22T14:11:57.412+0100 7f5dc3562700  1 mds.mds1 asok_command: client ls {prefix=client ls} (starting...) 2023-11-22T14:12:02.416+0100 7f5dc3562700  1 mds.mds1 asok_command: client ls {prefix=client ls} (starting...) 2023-11-22T14:12:07.420+0100 7f5dc3562700  1 mds.mds1 asok_command: client ls {prefix=client ls} (starting...) 2023-11-22T14:12:12.420+0100 7f5dc3562700  1 mds.mds1 asok_command: client ls {prefix=client ls} (starting...) 2023-11-22T14:12:13.552+0100 7f5dc155e700  1 mds.mds1 Updating MDS map to version 9609 from mon.1 2023-11-22T14:12:13.552+0100 7f5dc155e700  1 mds.mds1 Map removed me [mds.mds1{0:5320528} state up:stopping seq 67 addr [v2:10.245.4.103:6800/1548097835,v1:10.245.4.103:6801/1548097835] compat {c=[1],r=[1],i=[7ff]}] from cluster; respawnin
g! See cluster/monitor logs for details.
2023-11-22T14:12:13.552+0100 7f5dc155e700  1 mds.mds1 respawn!

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux