Hello helpful mailing list folks! After a networking outage, I had a MDS rank failure (originally 3 MDS ranks) that has left my CephFS cluster in a bad shape. I worked through most of the Disaster Recovery guide (https://docs.ceph.com/docs/nautilus/cephfs/disaster-recovery-experts/#disaster-recovery-experts) and got my CephFS remounted and (mostly) available. I have additionally completed the lengthy extents and inodes scan. For the most part, I am working fine, but for now have reduced my max MDS down to 1. However, it looks however that I have a MDS dir_frag issue and damaged metadata on a specific directory when accessed. Here's the appropriate commands/outputs: # ceph version ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable) # ceph -s cluster: id: <redacted> health: HEALTH_ERR 1 MDSs report damaged metadata services: mon: 3 daemons, quorum katz-c1,katz-c2,katz-c3 (age 6d) mgr: katz-c2(active, since 10d), standbys: katz-c1, katz-c3 mds: cephfs-katz:1 {0=katz-mds-3=up:active} 5 up:standby osd: 9 osds: 9 up (since 6d), 9 in (since 6d) data: pools: 7 pools, 312 pgs objects: 19.82M objects, 1.5 TiB usage: 6.6 TiB used, 11 TiB / 17 TiB avail pgs: 311 active+clean 1 active+clean+scrubbing+deep # ceph health detail HEALTH_ERR 1 MDSs report damaged metadata MDS_DAMAGE 1 MDSs report damaged metadata mdskatz-mds-3(mds.0): Metadata damage detected # ceph tell mds.0 damage ls [ { "damage_type": "dir_frag", "id": 575440387, "ino": 1099550285476, "frag": "*", "path": "/bad/TSAL/conf8N5LVl" } ] # ceph tell mds.0 scrub start /bad recursive repair { "return_code": 0, "scrub_tag": "887fa41d-4643-4b2d-bb7d-8f96c02c2b4d", "mode": "asynchronous" } After a few seconds, # ceph tell mds.0 scrub status { "status": "no active scrubs running", "scrubs": {} } The scrub appears to not do anything to fix the issue. I have isolated the directory in my file system (/bad) and do not need the contents of the directory anymore (backups woo!), however a typical "rm -rf" command on the directory fails. The next steps for recovery is where I am struggling. There have been other emails to this list before about this topic ( https://www.spinics.net/lists/ceph-users/msg53211.html ) but the commands referenced are a bit foreign to me, and I was wondering if you all could provide some additional insight on the exact commands needed. From what I can gathered in the previous thread, I need to: 1. Get the inode for the parent directory ( /bad/TSAL ) # cd /mnt/cephfs/bad # stat TSAL File: TSAL Size: 3 Blocks: 0 IO Block: 65536 directory Device: 2eh/46d Inode: 1099550201759 Links: 4 Access: (0700/drwx------) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2020-06-01 15:50:41.000000000 -0400 Modify: 2020-06-11 15:21:42.079970103 -0400 Change: 2020-06-11 15:21:42.079970103 -0400 Birth: - (So in this case, /bad/TSAL inode: 1099550201759) 2. Check if omap key '1_head' exists in object <inode of directory in hex>.00000000. If it exists, remove it. This is where I am clueless on how to continue. How do I check if the omap key '1_head' exists, and if so, remove it? What commands am I working with here? (Inode decimal to hex: 1099550201759 -> 100024C979F) Thank you much! Chris Wieringa _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx