Re: Abandon incomplete (damaged EC) pgs - How to manage the impact on cephfs?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Joshua,

I have had a similar issue three different times on one of my cephfs pools (15.2.10). The first time this happened I had lost some OSDs. In all cases I ended up with degraded PGs with unfound objects that could not be recovered.

Here's how I recovered from the situation. Note that this will permanently remove the affected files from ceph. Restoring them from backup is an excercise left to the reader.

* Make a list of the affected PGs:
  ceph pg dump_stuck  | grep recovery_unfound > pg.txt

* Make a list of the affected objects (OIDs):
cat pg.txt | awk '{print $1}' | while read pg ; do echo $pg ; ceph pg $pg list_unfound | jq '.objects[].oid.oid' ; done | sed -e 's/"//g' > oid.txt

* Convert the OID numbers to inodes using 'printf "%d\n" 0x${oid}' and put the results in a file called 'inum.txt'

* On a ceph client, find the files that correspond to the affected inodes:
cat inum.txt | while read inum ; do echo -n "${inum} " ; find /ceph/frames/O3/raw -inum ${inum} ; done > files.txt

* It may be helpful to put this table of PG, OID, inum, and files into a spreadsheet to keep track of what's been done.

* On the ceph client, use 'unlink' to remove the files from the filesystem. Do not use 'rm', as it will hang while calling 'stat()' on each file. Even unlink may hang when you first try it. If it does hang, do the following to get it unstuck:
  - Reboot the client
- Restart each mon and the mgr. I rebooted each mon/mgr, but it may be sufficient to restart the services without a reboot.
  - Try using 'unlink' again

* After all of the affected files have been removed, go through the list of PGs and remove the unfound OIDs:
  ceph pg $pgid mark_unfound_lost delete

...or if you're feeling brave, delete them all at once:
cat pg.txt | awk '{print $1}' | while read pg ; do echo $pg ; ceph pg $pg mark_unfound_lost delete ; done

* Watch the output of 'ceph -s' to see the health of the pools/pgs recover.

* Restore the deleted files from backup, or decide that you don't care about them and don't do anything This procedure lets you fix the problem without deleting the affected pool. To be honest, the first time it happened, my solution was to first copy all of the data off of the affected pool and onto a new pool. I later found this to be unnecessary. But if you want to pursue this, here's what I suggest:

* Follow the steps above to get rid of the affected files. I feel this should still be done even though you don't care about saving the data, to prevent corruption in the cephfs metadata.

* Go through the entire filesystem and look for:
  - files that are located on the pool (ceph.file.layout.pool = $pool_name)
- directories that are set to write files to the pool (ceph.dir.layout.pool = $pool_name)

* After you confirm that no files or directories are pointing at the pool anymore, run 'ceph df' and look at the number of objects in the pool. Ideally, it would be zero. But more than likely it isn't. This could be a simple mismatch in the object count in cephfs (harmless), or there could be clients with open filehandles on files that have been removed. such objects will still appear in the rados listing of the pool[1]:
  rados -p $pool_name ls
for obj in $(rados -p $pool_name ls); do echo $obj; rados -p $pool_name getxattr parent | strings; done

* To check for clients with access to these stray objects, dump the mds cache:
  ceph daemon mds.ceph1 dump cache /tmp/cache.txt

* Look for lines that refer to the stray objects, like this:
[inode 0x10000020fbc [2,head] ~mds0/stray6/10000020fbc auth v7440537 s=252778863 nl=0 n(v0 rc2020-12-11T21:17:59.454863-0600 b252778863 1=1+0) (iversion lock) caps={9541437=pAsLsXsFscr/pFscr@2},l=9541437 | caps=1 authpin=0 0x563a7e52a000]

* The 'caps' field in the output above contains the client session id (eg 9541437). Search the MDS for sessions that match to identify the client:
  ceph daemon mds.ceph1 session ls > session.txt
Search through 'session.txt' for matching entries. This will give you the IP address of the client:
        "id": 9541437,
        "entity": {
            "name": {
                "type": "client",
                "num": 9541437
            },
            "addr": {
                "type": "v1",
                "addr": "10.13.5.48:0",
                "nonce": 2011077845
            }
        },

* Restart the client's connection to ceph to get it to drop the cap. I did this by rebooting the client, but there may be gentler ways to do it.

* Once you've done this clean up, it should be safe to remove the pool from cephfs:
  ceph fs rm_data_pool $fs_name $pool_name

* Once the pool has been detached from cephfs, you can remove it from ceph altogether:
  ceph osd pool rm $pool_name $pool_name --yes-i-really-really-mean-it

Hope this helps,

--Mike
[1]http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-October/005234.html



On 4/8/21 5:41 PM, Joshua West wrote:
Hey everyone.

Inside of cephfs, I have a directory which I setup a directory layout
field to use an erasure coded (CLAY) pool, specific to the task. The
rest of my cephfs is using normal replication.

Fast forward some time, and the EC directory has been used pretty
extensively, and through some bad luck and poor timing, ~200pgs are in
an incomplete state, and the OSDs are completely gone and
unrecoverable. (Specifically OSD 31 and 34, not that it matters at
this point)

# ceph pg ls incomplete --> is attached for reference.

Fortunately, it's primarily (only) my on-site backups, and other
replaceable data inside of

I tried for a few days to recover the PGs:
  - Recreate blank OSDs with correct ID (was blocked by non-existant OSDs)
  - Deep Scrub
  - osd_find_best_info_ignore_history_les = true (`pg query` was
showing related error)
etc.

I've finally just accepted this pool to be a lesson learned, and want
to get the rest of my cephfs back to normal.

My questions:

  -- `ceph osd force-create-pg` doesn't appear to fix pgs, even for pgs
with 0 objects
  -- Deleting the pool seems like an appropriate step, but as I am
using an xattr within cephfs, which is otherwise on another pool, I am
not confident that this approach is safe?
  -- cephfs currently blocks when attemping to impact every third file
in the EC directory. Once I delete the pool, how will I remove the
files if even `rm` is blocking?

Thank you for your time,

Joshua West
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux