All,
I was called in to assist in a failed Ceph environment with the cluster
in an inoperable state. No rbd volumes are mountable/exportable due to
missing PGs.
The previous operator was using a replica count of 2. The cluster
suffered a power outage and various non-catastrophic hardware issues as
they were starting it back up. At some point during recovery, drives
were removed from the cluster leaving several PGs missing.
Efforts to restore the missing PGs from the data on the removed drives
failed using the process detailed in a Red Hat Customer Support blog
post [0]. Upon starting the OSDs with recovered PGs, a segfault halts
progress. The original operator isn't clear on when, but there may have
been a software upgrade applied after the drives were pulled.
I believe the cluster may be irrecoverable at this point.
My recovery assistance has focused on a plan to:
1) Scrape all objects for several key rbd volumes from live OSDs and the
removed former OSD drives.
2) Compare and deduplicate the two copies of each object.
3) Recombine the objects for each volume into a raw image.
I have completed steps 1 and 2 with apparent success. My initial stab at
step 3 yielded a raw image that could be mounted and had signs of a
filesystem, but it could not be read. Could anyone assist me with the
following questions?
1) Are the rbd objects in order by filename? If not, what is the method
to determine their order?
2) How should objects smaller than the default 4MB chunk size be
handled? Should they be padded somehow?
3) If any objects were completely missing and therefore unavailable to
this process, how should they be handled? I assume we need to offset/pad
to compensate.
--
Thanks,
Mike Dawson
Co-Founder & Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250
M: 317-490-3018
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com