Healthy objects trapped in incomplete pgs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear Cephers,


A few days ago disaster struck the Ceph cluster (erasure-coded) I am administrating, as the UPS power was pull from the cluster causing a power outage.


After rebooting the system, 6 osds were lost (spread over 5 osd nodes) as they could not mount anymore, several others had damages. This was more than the host-faliure domain was setup to handle and auto-recovery failed and osds started downing in a cascading maner.


When the dust settled, there were 8 pgs (of 2048) inactive and a bunch of osds down. I managed to recover 5 pgs, mainly by ceph-objectstore-tool export/import/repair commands, but now I am left with 3 pgs that are inactive and incomplete.


One of the pgs seems un-salvageable, as I cannot get to become active at all (repair/import/export/lowering min_size), but the two others I can get active if I export/import one of the pg shards and restart osd.


Rebuilding then starts but after a while one of the osds holding the pgs goes down, with a "FAILED ceph_assert(clone_size.count(clone))" message in the log.

If I set osds to noout nodown, then I can that it is only rather few objects e.g. 161 of a pg of >100000, that are failing to be remapped.


Since most of the object in the two pgs seem intact, it would be sad to delete the whole pg (force-create-pg) and loose all that data.


Is there a way to show and delete the failing objects?


I have thought of a recovery plan and want to share that with you, so you can comment on this if it sounds doable or not?


  *   Stop osds from recovering:    ceph osd set norecover
  *   bring back pgs active:            ceph-objectstore-tool export/import and restart osd
  *   find files in pgs:                      cephfs-data-scan pg_files <path> <pg id>
  *   pull out as many as possible of those files to other location.
  *   recreate pgs:                          ceph osd force-create-pg <pgid>
  *   restart recovery:                        ceph osd unset norecover
  *   copy back in the recovered files


Would that work or do you have a better suggestion?


Cheers,

Jesper


--------------------------
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Gustav Wieds Vej 10
8000 Aarhus C

E-mail: jelka@xxxxxxxxx
Tlf:    +45 50906203

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux