Re: Abandon incomplete (damaged EC) pgs - How to manage the impact on cephfs?

Michael Thomas <wart@xxxxxxxxxxx> · Thu, 8 Apr 2021 19:15:18 -0500

Hi Joshua,

I have had a similar issue three different times on one of my cephfs 
pools (15.2.10). The first time this happened I had lost some OSDs.  In 
all cases I ended up with degraded PGs with unfound objects that could 
not be recovered.

Here's how I recovered from the situation.  Note that this will 
permanently remove the affected files from ceph.  Restoring them from 
backup is an excercise left to the reader.

* Make a list of the affected PGs:
  ceph pg dump_stuck  | grep recovery_unfound > pg.txt

* Make a list of the affected objects (OIDs):
  cat pg.txt | awk '{print $1}' | while read pg ; do echo $pg ; ceph pg 
$pg list_unfound | jq '.objects[].oid.oid' ; done | sed -e 's/"//g' > 
oid.txt

* Convert the OID numbers to inodes using 'printf "%d\n" 0x${oid}' and 
put the results in a file called 'inum.txt'

* On a ceph client, find the files that correspond to the affected inodes:
  cat inum.txt | while read inum ; do echo -n "${inum} " ; find 
/ceph/frames/O3/raw -inum ${inum} ; done > files.txt

* It may be helpful to put this table of PG, OID, inum, and files into a 
spreadsheet to keep track of what's been done.

* On the ceph client, use 'unlink' to remove the files from the 
filesystem.  Do not use 'rm', as it will hang while calling 'stat()' on 
each file.  Even unlink may hang when you first try it.  If it does 
hang, do the following to get it unstuck:
  - Reboot the client
  - Restart each mon and the mgr.  I rebooted each mon/mgr, but it may 
be sufficient to restart the services without a reboot.
  - Try using 'unlink' again

* After all of the affected files have been removed, go through the list 
of PGs and remove the unfound OIDs:
  ceph pg $pgid mark_unfound_lost delete

...or if you're feeling brave, delete them all at once:
  cat pg.txt | awk '{print $1}' | while read pg ; do echo $pg ; ceph pg 
$pg mark_unfound_lost delete ; done

* Watch the output of 'ceph -s' to see the health of the pools/pgs recover.

* Restore the deleted files from backup, or decide that you don't care 
about them and don't do anything
This procedure lets you fix the problem without deleting the affected 
pool.  To be honest, the first time it happened, my solution was to 
first copy all of the data off of the affected pool and onto a new pool. 
 I later found this to be unnecessary.  But if you want to pursue this, 
here's what I suggest:

* Follow the steps above to get rid of the affected files.  I feel this 
should still be done even though you don't care about saving the data, 
to prevent corruption in the cephfs metadata.

* Go through the entire filesystem and look for:
  - files that are located on the pool (ceph.file.layout.pool = $pool_name)
  - directories that are set to write files to the pool 
(ceph.dir.layout.pool = $pool_name)

* After you confirm that no files or directories are pointing at the 
pool anymore, run 'ceph df' and look at the number of objects in the 
pool.  Ideally, it would be zero.  But more than likely it isn't.  This 
could be a simple mismatch in the object count in cephfs (harmless), or 
there could be clients with open filehandles on files that have been 
removed.  such objects will still appear in the rados listing of the 
pool[1]:
  rados -p $pool_name ls
  for obj in $(rados -p $pool_name ls); do echo $obj; rados -p 
$pool_name getxattr parent | strings; done

* To check for clients with access to these stray objects, dump the mds 
cache:
  ceph daemon mds.ceph1 dump cache /tmp/cache.txt

* Look for lines that refer to the stray objects, like this:
  [inode 0x10000020fbc [2,head] ~mds0/stray6/10000020fbc auth v7440537 
s=252778863 nl=0 n(v0 rc2020-12-11T21:17:59.454863-0600 b252778863 
1=1+0) (iversion lock) caps={9541437=pAsLsXsFscr/pFscr@2},l=9541437 | 
caps=1 authpin=0 0x563a7e52a000]

* The 'caps' field in the output above contains the client session id 
(eg 9541437).  Search the MDS for sessions that match to identify the 
client:
  ceph daemon mds.ceph1 session ls > session.txt
  Search through 'session.txt' for matching entries.  This will give 
you the IP address of the client:
        "id": 9541437,
        "entity": {
            "name": {
                "type": "client",
                "num": 9541437
            },
            "addr": {
                "type": "v1",
                "addr": "10.13.5.48:0",
                "nonce": 2011077845
            }
        },

* Restart the client's connection to ceph to get it to drop the cap.  I 
did this by rebooting the client, but there may be gentler ways to do it.

* Once you've done this clean up, it should be safe to remove the pool 
from cephfs:
  ceph fs rm_data_pool $fs_name $pool_name

* Once the pool has been detached from cephfs, you can remove it from 
ceph altogether:
  ceph osd pool rm $pool_name $pool_name --yes-i-really-really-mean-it

Hope this helps,

--Mike
[1]http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-October/005234.html

On 4/8/21 5:41 PM, Joshua West wrote:
Hey everyone.

Inside of cephfs, I have a directory which I setup a directory layout
field to use an erasure coded (CLAY) pool, specific to the task. The
rest of my cephfs is using normal replication.

Fast forward some time, and the EC directory has been used pretty
extensively, and through some bad luck and poor timing, ~200pgs are in
an incomplete state, and the OSDs are completely gone and
unrecoverable. (Specifically OSD 31 and 34, not that it matters at
this point)

# ceph pg ls incomplete --> is attached for reference.

Fortunately, it's primarily (only) my on-site backups, and other
replaceable data inside of

I tried for a few days to recover the PGs:
  - Recreate blank OSDs with correct ID (was blocked by non-existant OSDs)
  - Deep Scrub
  - osd_find_best_info_ignore_history_les = true (`pg query` was
showing related error)
etc.

I've finally just accepted this pool to be a lesson learned, and want
to get the rest of my cephfs back to normal.

My questions:

  -- `ceph osd force-create-pg` doesn't appear to fix pgs, even for pgs
with 0 objects
  -- Deleting the pool seems like an appropriate step, but as I am
using an xattr within cephfs, which is otherwise on another pool, I am
not confident that this approach is safe?
  -- cephfs currently blocks when attemping to impact every third file
in the EC directory. Once I delete the pool, how will I remove the
files if even `rm` is blocking?

Thank you for your time,

Joshua West
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx