Re: Abandon incomplete (damaged EC) pgs - How to manage the impact on cephfs?

Joshua West <josh@xxxxxxx> · Wed, 14 Apr 2021 06:03:54 -0600

Just working this through, how does one identify the OIDs within a PG,
without list_unfound?

I've been poking around, but can't seem to find a command that outputs
the necessary OIDs. I tried a handful of cephfs commands, but they of
course become stuck, and ceph pg commands haven't revealed the OID
yet.

Joshua

Joshua West
President
403-456-0072
CAYK.ca

On Fri, Apr 9, 2021 at 12:15 PM Joshua West <josh@xxxxxxx> wrote:
>
> Absolutely!
>
> Attached the files, they're not duplicate, but revised (as I tidied up
> what I could to make things easier)
>
> > Correct me if I'm wrong, but you are willing to throw away all of the data on this pool?
>
> Correct, if push comes to shove, I accept that data-loss is probable.
> If I can manage to save the data, I would definitely be okay with that
> too though.
>
> Still learning to program, but know python quite well. I am going to
> push off on a script to clean up per your previously noted steps in
> the language I know! But will hold off on unlinking everything for the
> moment.
>
> Thank you again for your time, your help has already been invaluable to me.
>
> Joshua
>
>
> Joshua West
> President
> 403-456-0072
> CAYK.ca
>
>
> On Fri, Apr 9, 2021 at 7:03 AM Michael Thomas <wart@xxxxxxxxxxx> wrote:
> >
> > Hi Joshua,
> >
> > I'll dig into this output a bit more later, but here are my thoughts
> > right now.  I'll preface this by saying that I've never had to clean up
> > from unrecoverable incomplete PGs, so some of what I suggest may not
> > work/apply or be the ideal fix in your case.
> >
> > Correct me if I'm wrong, but you are willing to throw away all of the
> > data on this pool?  This should make it easier because we don't have to
> > worry about recovering any lost data.
> >
> > If this is the case, then I think the general strategy would be:
> >
> > 1) Identify and remove any files/directories in cephfs that are located
> > on this pool (based on ceph.file.layout.pool=claypool and
> > ceph.dir.layout.pool=claypool).  Use 'unlink' instead of 'rm' to remove
> > the files; it should be less prone to hanging.
> >
> > 2) Wait a bit for ceph to clean up any unreferenced objects.  Watch the
> > output of 'ceph df' to see how many objects are listed for the pool.
> >
> > 3) Use 'rados -p claypool ls' to identify the remaining objects.  Use
> > the OID identifier to calculate the inode number of each file, then
> > search cephfs to identify which files these belong to.  I would expect
> > it would be none, as you already deleted the files in step 1.
> >
> > 4) With nothing in the cephfs metadata referring to the objects anymore,
> > it should be safe to remove them with 'rados -p rm'.
> >
> > 5) Remove the now-empty pool from cephfs
> >
> > 6) Remove the now-empty pool from ceph
> >
> > Can you also include the output of 'ceph df'?
> >
> > --Mike
> >
> > On 4/9/21 7:31 AM, Joshua West wrote:
> > > Thank you Mike!
> > >
> > > This is honestly a way more detailed reply than I was expecting.
> > > You've equipped me with new tools to work with.  Thank you!
> > >
> > > I don't actually have any unfound pgs... only "incomplete" ones, which
> > > limits the usefulness of:
> > > `grep recovery_unfound`
> > > `ceph pg $pg list_unfound`
> > > `ceph pg $pg mark_unfound_lost delete`
> > >
> > > I don't seem to see equivalent commands for incomplete pgs, save for
> > > grep of course.
> > >
> > > This does make me slightly more hopeful that recovery might be
> > > possible if the pgs are incomplete and stuck, but not unfound..? Not
> > > going to get my hopes too high.
> > >
> > > Going to attach a few items just to keep from bugging me, if anyone
> > > can take a glance, it would be appreciated.
> > >
> > > In the meantime, in the absence of the above commands, what's the best
> > > way to clean this up under the assumption that the data is lost?
> > >
> > > ~Joshua
> > >
> > >
> > > Joshua West
> > > President
> > > 403-456-0072
> > > CAYK.ca
> > >
> > >
> > > On Thu, Apr 8, 2021 at 6:15 PM Michael Thomas <wart@xxxxxxxxxxx> wrote:
> > >>
> > >> Hi Joshua,
> > >>
> > >> I have had a similar issue three different times on one of my cephfs
> > >> pools (15.2.10). The first time this happened I had lost some OSDs.  In
> > >> all cases I ended up with degraded PGs with unfound objects that could
> > >> not be recovered.
> > >>
> > >> Here's how I recovered from the situation.  Note that this will
> > >> permanently remove the affected files from ceph.  Restoring them from
> > >> backup is an excercise left to the reader.
> > >>
> > >> * Make a list of the affected PGs:
> > >>     ceph pg dump_stuck  | grep recovery_unfound > pg.txt
> > >>
> > >> * Make a list of the affected objects (OIDs):
> > >>     cat pg.txt | awk '{print $1}' | while read pg ; do echo $pg ; ceph pg
> > >> $pg list_unfound | jq '.objects[].oid.oid' ; done | sed -e 's/"//g' >
> > >> oid.txt
> > >>
> > >> * Convert the OID numbers to inodes using 'printf "%d\n" 0x${oid}' and
> > >> put the results in a file called 'inum.txt'
> > >>
> > >> * On a ceph client, find the files that correspond to the affected inodes:
> > >>     cat inum.txt | while read inum ; do echo -n "${inum} " ; find
> > >> /ceph/frames/O3/raw -inum ${inum} ; done > files.txt
> > >>
> > >> * It may be helpful to put this table of PG, OID, inum, and files into a
> > >> spreadsheet to keep track of what's been done.
> > >>
> > >> * On the ceph client, use 'unlink' to remove the files from the
> > >> filesystem.  Do not use 'rm', as it will hang while calling 'stat()' on
> > >> each file.  Even unlink may hang when you first try it.  If it does
> > >> hang, do the following to get it unstuck:
> > >>     - Reboot the client
> > >>     - Restart each mon and the mgr.  I rebooted each mon/mgr, but it may
> > >> be sufficient to restart the services without a reboot.
> > >>     - Try using 'unlink' again
> > >>
> > >> * After all of the affected files have been removed, go through the list
> > >> of PGs and remove the unfound OIDs:
> > >>     ceph pg $pgid mark_unfound_lost delete
> > >>
> > >> ...or if you're feeling brave, delete them all at once:
> > >>     cat pg.txt | awk '{print $1}' | while read pg ; do echo $pg ; ceph pg
> > >> $pg mark_unfound_lost delete ; done
> > >>
> > >> * Watch the output of 'ceph -s' to see the health of the pools/pgs recover.
> > >>
> > >> * Restore the deleted files from backup, or decide that you don't care
> > >> about them and don't do anything
> > >> This procedure lets you fix the problem without deleting the affected
> > >> pool.  To be honest, the first time it happened, my solution was to
> > >> first copy all of the data off of the affected pool and onto a new pool.
> > >>    I later found this to be unnecessary.  But if you want to pursue this,
> > >> here's what I suggest:
> > >>
> > >> * Follow the steps above to get rid of the affected files.  I feel this
> > >> should still be done even though you don't care about saving the data,
> > >> to prevent corruption in the cephfs metadata.
> > >>
> > >> * Go through the entire filesystem and look for:
> > >>     - files that are located on the pool (ceph.file.layout.pool = $pool_name)
> > >>     - directories that are set to write files to the pool
> > >> (ceph.dir.layout.pool = $pool_name)
> > >>
> > >> * After you confirm that no files or directories are pointing at the
> > >> pool anymore, run 'ceph df' and look at the number of objects in the
> > >> pool.  Ideally, it would be zero.  But more than likely it isn't.  This
> > >> could be a simple mismatch in the object count in cephfs (harmless), or
> > >> there could be clients with open filehandles on files that have been
> > >> removed.  such objects will still appear in the rados listing of the
> > >> pool[1]:
> > >>     rados -p $pool_name ls
> > >>     for obj in $(rados -p $pool_name ls); do echo $obj; rados -p
> > >> $pool_name getxattr parent | strings; done
> > >>
> > >> * To check for clients with access to these stray objects, dump the mds
> > >> cache:
> > >>     ceph daemon mds.ceph1 dump cache /tmp/cache.txt
> > >>
> > >> * Look for lines that refer to the stray objects, like this:
> > >>     [inode 0x10000020fbc [2,head] ~mds0/stray6/10000020fbc auth v7440537
> > >> s=252778863 nl=0 n(v0 rc2020-12-11T21:17:59.454863-0600 b252778863
> > >> 1=1+0) (iversion lock) caps={9541437=pAsLsXsFscr/pFscr@2},l=9541437 |
> > >> caps=1 authpin=0 0x563a7e52a000]
> > >>
> > >> * The 'caps' field in the output above contains the client session id
> > >> (eg 9541437).  Search the MDS for sessions that match to identify the
> > >> client:
> > >>     ceph daemon mds.ceph1 session ls > session.txt
> > >>     Search through 'session.txt' for matching entries.  This will give
> > >> you the IP address of the client:
> > >>           "id": 9541437,
> > >>           "entity": {
> > >>               "name": {
> > >>                   "type": "client",
> > >>                   "num": 9541437
> > >>               },
> > >>               "addr": {
> > >>                   "type": "v1",
> > >>                   "addr": "10.13.5.48:0",
> > >>                   "nonce": 2011077845
> > >>               }
> > >>           },
> > >>
> > >> * Restart the client's connection to ceph to get it to drop the cap.  I
> > >> did this by rebooting the client, but there may be gentler ways to do it.
> > >>
> > >> * Once you've done this clean up, it should be safe to remove the pool
> > >> from cephfs:
> > >>     ceph fs rm_data_pool $fs_name $pool_name
> > >>
> > >> * Once the pool has been detached from cephfs, you can remove it from
> > >> ceph altogether:
> > >>     ceph osd pool rm $pool_name $pool_name --yes-i-really-really-mean-it
> > >>
> > >> Hope this helps,
> > >>
> > >> --Mike
> > >> [1]http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-October/005234.html
> > >>
> > >>
> > >>
> > >> On 4/8/21 5:41 PM, Joshua West wrote:
> > >>> Hey everyone.
> > >>>
> > >>> Inside of cephfs, I have a directory which I setup a directory layout
> > >>> field to use an erasure coded (CLAY) pool, specific to the task. The
> > >>> rest of my cephfs is using normal replication.
> > >>>
> > >>> Fast forward some time, and the EC directory has been used pretty
> > >>> extensively, and through some bad luck and poor timing, ~200pgs are in
> > >>> an incomplete state, and the OSDs are completely gone and
> > >>> unrecoverable. (Specifically OSD 31 and 34, not that it matters at
> > >>> this point)
> > >>>
> > >>> # ceph pg ls incomplete --> is attached for reference.
> > >>>
> > >>> Fortunately, it's primarily (only) my on-site backups, and other
> > >>> replaceable data inside of
> > >>>
> > >>> I tried for a few days to recover the PGs:
> > >>>    - Recreate blank OSDs with correct ID (was blocked by non-existant OSDs)
> > >>>    - Deep Scrub
> > >>>    - osd_find_best_info_ignore_history_les = true (`pg query` was
> > >>> showing related error)
> > >>> etc.
> > >>>
> > >>> I've finally just accepted this pool to be a lesson learned, and want
> > >>> to get the rest of my cephfs back to normal.
> > >>>
> > >>> My questions:
> > >>>
> > >>>    -- `ceph osd force-create-pg` doesn't appear to fix pgs, even for pgs
> > >>> with 0 objects
> > >>>    -- Deleting the pool seems like an appropriate step, but as I am
> > >>> using an xattr within cephfs, which is otherwise on another pool, I am
> > >>> not confident that this approach is safe?
> > >>>    -- cephfs currently blocks when attemping to impact every third file
> > >>> in the EC directory. Once I delete the pool, how will I remove the
> > >>> files if even `rm` is blocking?
> > >>>
> > >>> Thank you for your time,
> > >>>
> > >>> Joshua West
> > >>> _______________________________________________
> > >>> ceph-users mailing list -- ceph-users@xxxxxxx
> > >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > >>>
> > >>
> >
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx