Bluestore migration disaster - incomplete pgs recovery process and progress (in progress)

Brady Deetz <bdeetz@xxxxxxxxx> · Sun, 7 Jan 2018 13:38:49 -0600

Below is the status of my disastrous self-inflicted journey. I will preface this by admitting this could not have been prevented by software attempting to keep me from being stupid.
I have a production cluster with over 350 XFS backed osds running Luminous. We want to transition the cluster to Bluestore for the purpose of enabling EC for CephFS. We are currently at 75+% utilization and EC coding could really help us reclaim some much needed capacity. Formatting 1 osd at a time and waiting on the cluster to backfill for every disk was going to take a very long time (based on our observations an estimated 240+ days). Formatting an entire host at once caused a little too much turbulence in the cluster. Furthermore, we could start the transition to EC if we had enough hosts with enough disks running Bluestore, before the entire cluster was migrated. As such, I decided to parallelize. The general idea is that we could format any osd that didn't have anything other than active+clean pgs associated. I maintain that this method should work. But, something went terribly wrong with the script and somehow we formatted disks in a manner that brought PGs into an incomplete state. It's now pretty obvious that the affected PGs were backfilling to other osds when the script clobbered the last remaining good set of objects.

This cluster serves CephFS and a few RBD volumes.

mailing list submissions related to this outage:
cephfs-data-scan pg_files errors
finding and manually recovering objects in bluestore
Determine cephfs paths and rados objects affected by incomplete pg
Our recovery
1) We allowed the cluster to repair itself as much as possible.

2) Following self-healing we were left with 3 PGs incomplete. 2 were in the cephfs data pool and 1 in an RBD pool.

3) Using ceph pg ${pgid} query, we found all disks known to have recently contained some of that PG's data

4) For each osd listed in the pg query, we exported the remaining PG data using ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-${osdid}/ --pgid ${pgid} --op export --file /media/ceph_recovery/ceph-${osdid}/recover.${pgid}

5) After having all of the possible exports we compared the recovery files and chose the largest. I would have appreciated the ability to do a merge of some sort on these exports, but we'll take what we can get. We're just going to assume the largest export was the most complete backfill at the time disaster struck.

6) We removed the nearly empty pg from the acting osds using ceph-objectstore-tool --op remove --data-path /var/lib/ceph/osd/ceph-${osdid} --pgid ${pgid}

7) We imported the largest export we had into the acting osds for the pg

8) We marked the pg as complete using the following on the primary acting ceph-objectstore-tool --op mark-complete --data-path /var/lib/ceph/osd/ceph-${osdid}/ --pgid ${pgid}

9) We were convinced that it would be possible multiple exports of the same partially backfilled PG different objects. As such, we started reversing the format of the export file to extract the objects from the exports and compared.

10) While our resident reverse engineer was hard at work, focus was shifted toward tooling for the purpose of identifying corrupt files, rbds, and appropriate actions for each
10a) A list of all rados objects were dumped for our most valuable data (CephFS). Our first mechanism of detection is a skip in object sequence numbers
10b) Because our metadata pool was unaffected by this mess, we are trusting that ls delivers correct file sizes even for corrupt files. As such, we should be able to identify how many objects make up the file. If the count of objects for that file's inode are less than that, there's a problem.. More than the calculated amount??? The world definitely explodes.
10c) Finally, the saddest check is if there are no objects in rados for that inode.

That's where we are right now. I'll update this thread as we get closer to recovery from backups and accepting data loss if necessary.

I will note that we wish there were some documentation on using on ceph-objectstore-tool. We understand that it's for emergencies, but that's when concise documentation is most important. From what we've found, the only documentation seems to be --help and the source code.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com