Abandon incomplete (damaged EC) pgs - How to manage the impact on cephfs?

Joshua West <josh@xxxxxxx> · Thu, 8 Apr 2021 16:41:11 -0600

Hey everyone.

Inside of cephfs, I have a directory which I setup a directory layout
field to use an erasure coded (CLAY) pool, specific to the task. The
rest of my cephfs is using normal replication.

Fast forward some time, and the EC directory has been used pretty
extensively, and through some bad luck and poor timing, ~200pgs are in
an incomplete state, and the OSDs are completely gone and
unrecoverable. (Specifically OSD 31 and 34, not that it matters at
this point)

# ceph pg ls incomplete --> is attached for reference.

Fortunately, it's primarily (only) my on-site backups, and other
replaceable data inside of

I tried for a few days to recover the PGs:
 - Recreate blank OSDs with correct ID (was blocked by non-existant OSDs)
 - Deep Scrub
 - osd_find_best_info_ignore_history_les = true (`pg query` was
showing related error)
etc.

I've finally just accepted this pool to be a lesson learned, and want
to get the rest of my cephfs back to normal.

My questions:

 -- `ceph osd force-create-pg` doesn't appear to fix pgs, even for pgs
with 0 objects
 -- Deleting the pool seems like an appropriate step, but as I am
using an xattr within cephfs, which is otherwise on another pool, I am
not confident that this approach is safe?
 -- cephfs currently blocks when attemping to impact every third file
in the EC directory. Once I delete the pool, how will I remove the
files if even `rm` is blocking?

Thank you for your time,

Joshua West
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx