hi! my cephfs is broken and i can not recover the mds-daemons. yesterday i have update my ceph-cluster from v15 to v16 and i thought all working fine. next day (today) some of my services goes down and throw errors, so i dig into the problem and find my cephfs is down, all mds-daemons in standby modus but no one is active, and cannot successfully restarted. my current status is: # ceph status cluster: id: acd880fe-5f42-4930-8071-c4894c9b678e health: HEALTH_ERR 1 filesystem is degraded 1 filesystem is offline 1 mds daemon damaged 11 scrub errors Possible data damage: 3 pgs inconsistent 2 daemons have recently crashed services: mon: 3 daemons, quorum pve04,pve05,pve06 (age 103m) mgr: pve04(active, since 107m), standbys: pve05, pve06 mds: 0/1 daemons up, 3 standby osd: 30 osds: 30 up (since 103m), 30 in (since 8M) rgw: 3 daemons active (3 hosts, 1 zones) data: volumes: 0/1 healthy, 1 recovering; 1 damaged pools: 12 pools, 800 pgs objects: 483.49k objects, 1.8 TiB usage: 5.3 TiB used, 104 TiB / 109 TiB avail pgs: 797 active+clean 3 active+clean+inconsistent+failed_repair io: client: 255 B/s rd, 229 KiB/s wr, 0 op/s rd, 17 op/s wr i know, there are also 3 inconsistent pgs, but this is another story. my next try was to repaired the mds: # ceph mds repaired 0 repaired: restoring rank 1:0 the log output call something about "corrupt values", checkout: https://pastebin.com/AePicagc so i do not know which file is corrupted? ceph.conf? the given errors "corrupt sessionmap values: Corrupt entity name in sessionmap" are thrown by this code: https://github.com/ceph/ceph/blob/master/src/mds/SessionMap.cc and there is also no "sessionmap" file on hard-drive: # find / -name '*.sessionmap' -> no results! my next try is the harder way, for now, i have tried this: # systemctl stop ceph-mds@pve04.service # cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary # cephfs-journal-tool --rank=cephfs:0 journal reset # cephfs-table-tool all reset session # systemctl start ceph-mds@pve04.service # ceph mds repaired 0 this is the log output: https://pastebin.com/DBRq8iwM not the same but similar errors... i'm a little bit confused about the definition of `ceph::buffer::v15_2_0::list`, so i'm running ceph v16?! on top of this ceph cluster, i'm running my virtual environment, most of my VMs are still running but how long? i'm very happe for any support! regards, volker. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx