PG's stuck incomplete on EC pool after multiple drive failure

Malcolm Haak <insanemal@xxxxxxxxx> · Fri, 29 Mar 2024 10:52:04 +1000

Hello all.

I have a cluster with ~80TB of spinning disk. Its primary role is
cephfs. Recently I had a multiple drive failure (it was not
simultaneous) but it's left me with 20 incomplete pg's

I know this data is toast, but I need to be able to get what isn't
toast out of the cephfs. Well out of that pool and into a new pool.
The issue is the PG's that are incomplete block IO and that hinders
browsing the filesystem.

I'm attempting to use the "new" ceph-objectstore-tool mark-complete
operation, but I'm struggling to work out what to mark complete. Being
that it's EC and each PG is made up of multiple shards (I think is the
right word) and they all have their own status.

I did manage to mark one of these shard pg's complete on what appeared
to be the primary OSD, however it had no effect on that shard when
checking it with ceph pg X query. By that I mean the shard was marked
incomplete before and after using ceph-ojbectstore-tool.

I'm running Ceph 18.2.2, I have all HDD osd's. I can get any logs that
will help, I'm just not 100% sure where to start and don't want to
just dump 20 PG's of ceph pg X query on the mailing list

Thanks so much

Mal
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx