Re: Abandon incomplete (damaged EC) pgs - How to manage the impact on cephfs?

Michael Thomas <wart@xxxxxxxxxxx> · Fri, 9 Apr 2021 08:03:15 -0500

Hi Joshua,

I'll dig into this output a bit more later, but here are my thoughts 
right now.  I'll preface this by saying that I've never had to clean up 
from unrecoverable incomplete PGs, so some of what I suggest may not 
work/apply or be the ideal fix in your case.

Correct me if I'm wrong, but you are willing to throw away all of the 
data on this pool?  This should make it easier because we don't have to 
worry about recovering any lost data.

If this is the case, then I think the general strategy would be:

1) Identify and remove any files/directories in cephfs that are located 
on this pool (based on ceph.file.layout.pool=claypool and 
ceph.dir.layout.pool=claypool).  Use 'unlink' instead of 'rm' to remove 
the files; it should be less prone to hanging.

2) Wait a bit for ceph to clean up any unreferenced objects.  Watch the 
output of 'ceph df' to see how many objects are listed for the pool.

3) Use 'rados -p claypool ls' to identify the remaining objects.  Use 
the OID identifier to calculate the inode number of each file, then 
search cephfs to identify which files these belong to.  I would expect 
it would be none, as you already deleted the files in step 1.

4) With nothing in the cephfs metadata referring to the objects anymore, 
it should be safe to remove them with 'rados -p rm'.

5) Remove the now-empty pool from cephfs

6) Remove the now-empty pool from ceph

Can you also include the output of 'ceph df'?

--Mike

On 4/9/21 7:31 AM, Joshua West wrote:
Thank you Mike!

This is honestly a way more detailed reply than I was expecting.
You've equipped me with new tools to work with.  Thank you!

I don't actually have any unfound pgs... only "incomplete" ones, which
limits the usefulness of:
`grep recovery_unfound`
`ceph pg $pg list_unfound`
`ceph pg $pg mark_unfound_lost delete`

I don't seem to see equivalent commands for incomplete pgs, save for
grep of course.

This does make me slightly more hopeful that recovery might be
possible if the pgs are incomplete and stuck, but not unfound..? Not
going to get my hopes too high.

Going to attach a few items just to keep from bugging me, if anyone
can take a glance, it would be appreciated.

In the meantime, in the absence of the above commands, what's the best
way to clean this up under the assumption that the data is lost?

~Joshua

Joshua West
President
403-456-0072
CAYK.ca

On Thu, Apr 8, 2021 at 6:15 PM Michael Thomas <wart@xxxxxxxxxxx> wrote:

Hi Joshua,

I have had a similar issue three different times on one of my cephfs
pools (15.2.10). The first time this happened I had lost some OSDs.  In
all cases I ended up with degraded PGs with unfound objects that could
not be recovered.

Here's how I recovered from the situation.  Note that this will
permanently remove the affected files from ceph.  Restoring them from
backup is an excercise left to the reader.

* Make a list of the affected PGs:
    ceph pg dump_stuck  | grep recovery_unfound > pg.txt

* Make a list of the affected objects (OIDs):
    cat pg.txt | awk '{print $1}' | while read pg ; do echo $pg ; ceph pg
$pg list_unfound | jq '.objects[].oid.oid' ; done | sed -e 's/"//g' >
oid.txt

* Convert the OID numbers to inodes using 'printf "%d\n" 0x${oid}' and
put the results in a file called 'inum.txt'

* On a ceph client, find the files that correspond to the affected inodes:
    cat inum.txt | while read inum ; do echo -n "${inum} " ; find
/ceph/frames/O3/raw -inum ${inum} ; done > files.txt

* It may be helpful to put this table of PG, OID, inum, and files into a
spreadsheet to keep track of what's been done.

* On the ceph client, use 'unlink' to remove the files from the
filesystem.  Do not use 'rm', as it will hang while calling 'stat()' on
each file.  Even unlink may hang when you first try it.  If it does
hang, do the following to get it unstuck:
    - Reboot the client
    - Restart each mon and the mgr.  I rebooted each mon/mgr, but it may
be sufficient to restart the services without a reboot.
    - Try using 'unlink' again

* After all of the affected files have been removed, go through the list
of PGs and remove the unfound OIDs:
    ceph pg $pgid mark_unfound_lost delete

...or if you're feeling brave, delete them all at once:
    cat pg.txt | awk '{print $1}' | while read pg ; do echo $pg ; ceph pg
$pg mark_unfound_lost delete ; done

* Watch the output of 'ceph -s' to see the health of the pools/pgs recover.

* Restore the deleted files from backup, or decide that you don't care
about them and don't do anything
This procedure lets you fix the problem without deleting the affected
pool.  To be honest, the first time it happened, my solution was to
first copy all of the data off of the affected pool and onto a new pool.
   I later found this to be unnecessary.  But if you want to pursue this,
here's what I suggest:

* Follow the steps above to get rid of the affected files.  I feel this
should still be done even though you don't care about saving the data,
to prevent corruption in the cephfs metadata.

* Go through the entire filesystem and look for:
    - files that are located on the pool (ceph.file.layout.pool = $pool_name)
    - directories that are set to write files to the pool
(ceph.dir.layout.pool = $pool_name)

* After you confirm that no files or directories are pointing at the
pool anymore, run 'ceph df' and look at the number of objects in the
pool.  Ideally, it would be zero.  But more than likely it isn't.  This
could be a simple mismatch in the object count in cephfs (harmless), or
there could be clients with open filehandles on files that have been
removed.  such objects will still appear in the rados listing of the
pool[1]:
    rados -p $pool_name ls
    for obj in $(rados -p $pool_name ls); do echo $obj; rados -p
$pool_name getxattr parent | strings; done

* To check for clients with access to these stray objects, dump the mds
cache:
    ceph daemon mds.ceph1 dump cache /tmp/cache.txt

* Look for lines that refer to the stray objects, like this:
    [inode 0x10000020fbc [2,head] ~mds0/stray6/10000020fbc auth v7440537
s=252778863 nl=0 n(v0 rc2020-12-11T21:17:59.454863-0600 b252778863
1=1+0) (iversion lock) caps={9541437=pAsLsXsFscr/pFscr@2},l=9541437 |
caps=1 authpin=0 0x563a7e52a000]

* The 'caps' field in the output above contains the client session id
(eg 9541437).  Search the MDS for sessions that match to identify the
client:
    ceph daemon mds.ceph1 session ls > session.txt
    Search through 'session.txt' for matching entries.  This will give
you the IP address of the client:
          "id": 9541437,
          "entity": {
              "name": {
                  "type": "client",
                  "num": 9541437
              },
              "addr": {
                  "type": "v1",
                  "addr": "10.13.5.48:0",
                  "nonce": 2011077845
              }
          },

* Restart the client's connection to ceph to get it to drop the cap.  I
did this by rebooting the client, but there may be gentler ways to do it.

* Once you've done this clean up, it should be safe to remove the pool
from cephfs:
    ceph fs rm_data_pool $fs_name $pool_name

* Once the pool has been detached from cephfs, you can remove it from
ceph altogether:
    ceph osd pool rm $pool_name $pool_name --yes-i-really-really-mean-it

Hope this helps,

--Mike
[1]http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-October/005234.html

On 4/8/21 5:41 PM, Joshua West wrote:
Hey everyone.

Inside of cephfs, I have a directory which I setup a directory layout
field to use an erasure coded (CLAY) pool, specific to the task. The
rest of my cephfs is using normal replication.

Fast forward some time, and the EC directory has been used pretty
extensively, and through some bad luck and poor timing, ~200pgs are in
an incomplete state, and the OSDs are completely gone and
unrecoverable. (Specifically OSD 31 and 34, not that it matters at
this point)

# ceph pg ls incomplete --> is attached for reference.

Fortunately, it's primarily (only) my on-site backups, and other
replaceable data inside of

I tried for a few days to recover the PGs:
   - Recreate blank OSDs with correct ID (was blocked by non-existant OSDs)
   - Deep Scrub
   - osd_find_best_info_ignore_history_les = true (`pg query` was
showing related error)
etc.

I've finally just accepted this pool to be a lesson learned, and want
to get the rest of my cephfs back to normal.

My questions:

   -- `ceph osd force-create-pg` doesn't appear to fix pgs, even for pgs
with 0 objects
   -- Deleting the pool seems like an appropriate step, but as I am
using an xattr within cephfs, which is otherwise on another pool, I am
not confident that this approach is safe?
   -- cephfs currently blocks when attemping to impact every third file
in the EC directory. Once I delete the pool, how will I remove the
files if even `rm` is blocking?

Thank you for your time,

Joshua West
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx