Hi all, we have a Ceph Cluster (12.2.1) with 9 MDS ranks in multi-mds mode. "out of the blue", rank 6 is marked as damaged (and all other MDS are in state up:resolve) and I can't bring the FS up again. 'ceph -s' says: [...] 1 filesystem is degraded 1 mds daemon damaged mds: cephfs-8/9/9 up {0=mds1=up:resolve,1=mds2=up:resolve,2=mds3=up:resolve,3=mds4=up:resolve,4=mds5=up:resolve,5=mds6=up:resolve,7=mds7= up:resolve,8=mds8=up:resolve}, 1 up:standby, 1 damaged [...] 'ceph fs get cephfs' says: [...] max_mds 9 in 0,1,2,3,4,5,6,7,8 up {0=28309098,1=28309128,2=28309149,3=28309188,4=28309209,5=28317918,7=28311732,8=28312272} failed damaged 6 stopped [...] 28309098: 147.87.226.60:6800/2627352929 'mds1' mds.0.95936 up:resolve seq 3 28309128: 147.87.226.61:6800/416822271 'mds2' mds.1.95939 up:resolve seq 3 28309149: 147.87.226.62:6800/1969015920 'mds3' mds.2.95942 up:resolve seq 3 28309188: 147.87.226.184:6800/4074580566 'mds4' mds.3.95945 up:resolve seq 3 28309209: 147.87.226.185:6800/805082194 'mds5' mds.4.95948 up:resolve seq 3 28317918: 147.87.226.186:6800/1913199036 'mds6' mds.5.95984 up:resolve seq 3 28311732: 147.87.226.187:6800/4117561729 'mds7' mds.7.95957 up:resolve seq 3 28312272: 147.87.226.188:6800/2936268159 'mds8' mds.8.95960 up:resolve seq 3 I think I've tried almost anything already without success :(, including: * stopping all MDS, and bringing them up one after one (works nice for the first ones up to rank 5, then the next one just grabs rank 7 and no MDS after that wants to take rank 6) * stopped all MDS, flushed MDS journal, manually marked rank 6 as repaired, started all MDS again. * tried to switch back to only one MDS (stopping all MDS, setting max_mds=1, disallowing multi-mds, disallowing dirfrag, removing "mds_bal_frag=true" from ceph.conf, then starting the first mds), didn't work.. the one single MDS stayed in up:resolve forever. * during all of the above, all CephFS clients have been unmounted, so there's no access/stale access to the FS * did find a few things in the mailinglist archive, but seems there's nothing conclusive on how to get it back online ("formating" the FS is not possible). I didn't dare trying 'ceph mds rmfailed 6' in fear of dataloss. How can I get it back online? The relevant portion from the ceph-mds log (when starting mds9 which should then take up rank 6; I'm happy to provide any logs): ---snip--- 2017-10-09 08:55:56.418237 7f1ec6ef3240 0 ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable), process (unknown), pid 421 2017-10-09 08:55:56.421672 7f1ec6ef3240 0 pidfile_write: ignore empty --pid-file 2017-10-09 08:56:00.990530 7f1ebf457700 1 mds.mds9 handle_mds_map standby 2017-10-09 08:56:00.997044 7f1ebf457700 1 mds.6.95988 handle_mds_map i am now mds.6.95988 2017-10-09 08:56:00.997053 7f1ebf457700 1 mds.6.95988 handle_mds_map state change up:boot --> up:replay 2017-10-09 08:56:00.997068 7f1ebf457700 1 mds.6.95988 replay_start 2017-10-09 08:56:00.997076 7f1ebf457700 1 mds.6.95988 recovery set is 0,1,2,3,4,5,7,8 2017-10-09 08:56:01.003203 7f1eb8c4a700 0 mds.6.cache creating system inode with ino:0x106 2017-10-09 08:56:01.003592 7f1eb8c4a700 0 mds.6.cache creating system inode with ino:0x1 2017-10-09 08:56:01.016403 7f1eba44d700 -1 mds.6.journaler.pq(ro) _decode error from assimilate_prefetch 2017-10-09 08:56:01.016425 7f1eba44d700 -1 mds.6.purge_queue _recover: Error -22 recovering write_pos 2017-10-09 08:56:01.019746 7f1eba44d700 1 mds.mds9 respawn 2017-10-09 08:56:01.019762 7f1eba44d700 1 mds.mds9 e: '/usr/bin/ceph-mds' 2017-10-09 08:56:01.019765 7f1eba44d700 1 mds.mds9 0: '/usr/bin/ceph-mds' 2017-10-09 08:56:01.019767 7f1eba44d700 1 mds.mds9 1: '-f' 2017-10-09 08:56:01.019769 7f1eba44d700 1 mds.mds9 2: '--cluster' 2017-10-09 08:56:01.019771 7f1eba44d700 1 mds.mds9 3: 'ceph' 2017-10-09 08:56:01.019772 7f1eba44d700 1 mds.mds9 4: '--id' 2017-10-09 08:56:01.019773 7f1eba44d700 1 mds.mds9 5: 'mds9' 2017-10-09 08:56:01.019774 7f1eba44d700 1 mds.mds9 6: '--setuser' 2017-10-09 08:56:01.019775 7f1eba44d700 1 mds.mds9 7: 'ceph' 2017-10-09 08:56:01.019776 7f1eba44d700 1 mds.mds9 8: '--setgroup' 2017-10-09 08:56:01.019778 7f1eba44d700 1 mds.mds9 9: 'ceph' 2017-10-09 08:56:01.019811 7f1eba44d700 1 mds.mds9 respawning with exe /usr/bin/ceph-mds 2017-10-09 08:56:01.019814 7f1eba44d700 1 mds.mds9 exe_path /proc/self/exe 2017-10-09 08:56:01.046396 7f5ed6090240 0 ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable), process (unknown), pid 421 2017-10-09 08:56:01.049516 7f5ed6090240 0 pidfile_write: ignore empty --pid-file 2017-10-09 08:56:05.162732 7f5ecee32700 1 mds.mds9 handle_mds_map standby [...] ---snap--- Regards, Daniel _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com