Re: Recovering incomplete PGs with ceph_objectstore_tool

Paul Evans <paul@xxxxxxxxxxxx> · Thu, 9 Apr 2015 18:16:57 +0000

Congrats Chris and nice "save" on that RBD!

--
Paul 

> On Apr 9, 2015, at 11:11 AM, Chris Kitzmiller <ckitzmiller@xxxxxxxxxxxxx> wrote:
> 
> Success! Hopefully my notes from the process will help:
> 
> In the event of multiple disk failures the cluster could lose PGs. Should this occur it is best to attempt to restart the OSD process and have the drive marked as up+out. Marking the drive as out will cause data to flow off the drive to elsewhere in the cluster. In the event that the ceph-osd process is unable to keep running you could try using the ceph_objectstore_tool program to extract just the damaged PGs and import them into working PGs.
> 
> Fixing Journals
> In this particular scenario things were complicated by the fact that ceph_objectstore_tool came out in Giant but we were running Firefly. Not wanting to upgrade the cluster in a degraded state this required that the OSD drives be moved to a different physical machine for repair. This added a lot of steps related to the journals but it wasn't a big deal. That process looks like:
> 
> On Storage1:
> stop ceph-osd id=15
> ceph-osd -i 15 --flush-journal
> ls -l /var/lib/ceph/osd/ceph-15/journal
> 
> Note the journal device UUID then pull the disk and move it to Ithome:
> rm /var/lib/ceph/osd/ceph-15/journal
> ceph-osd -i 15 --mkjournal
> 
> That creates a colocated journal for which to use during the ceph_objectstore_tool commands. Once done then:
> ceph-osd -i 15 --flush-journal
> rm /var/lib/ceph/osd/ceph-15/journal
> 
> Pull the disk and bring it back to Storage1. Then:
> ln -s /dev/disk/by-partitionuuid/b4f8d911-5ac9-4bf0-a06a-b8492e25a00f /var/lib/ceph/osd/ceph-15/journal
> ceph-osd -i 15 --mkjournal
> start ceph-osd id=15
> 
> This all won't be needed once the cluster is running Hammer because then there will be an available version of ceph_objectstore_tool on the local machine and you can keep the journals throughout the process.
> 
> 
> Recovery Process
> We were missing two PGs, 3.c7 and 3.102. These PGs were hosted on OSD.0 and OSD.15 which were the two disks which failed out of Storage1. The disk for OSD.0 seemed to be a total loss while the disk for OSD.15 was somewhat more cooperative but not in a place to be up and running in the cluster. I took the dying OSD.15 drive and placed it into a new physical machine with a fresh install of Ceph Giant. Using Giant's ceph_objectstore_tool I was able to extract the PGs with a command like:
> for i in 3.c7 3.102 ; do ceph_objectstore_tool --data /var/lib/ceph/osd/ceph-15 --journal /var/lib/ceph/osd/ceph-15/journal --op export --pgid $i --file ~/${i}.export
> 
> Once both PGs were successfully exported I attempted to import them into a new temporary OSD following instructions from here. For some reason that didn't work. The OSD was up+in but wasn't backfilling the PGs into the cluster. If you find yourself in this process I would try that first just in case it provides a cleaner process.
> Considering the above didn't work and we were looking at the possibility of losing the RBD volume (or perhaps worse, the potential of fruitlessly fscking 35TB) I took what I might describe as heroic measures:
> 
> Running
> ceph pg dump | grep incomplete
> 
> 3.c7   0  0  0  0  0  0  0  incomplete  2015-04-02  20:49:32.968841  0'0  15730:17  [15,0]  15  [15,0]  15  13985'54076  2015-03-31  19:14:22.721695  13985'54076  2015-03-31  19:14:22.721695
> 3.102  0  0  0  0  0  0  0  incomplete  2015-04-02  20:49:32.529594  0'0  15730:21  [0,15]  0   [0,15]  0   13985'53107  2015-03-29  21:17:15.568125  13985'49195  2015-03-24  18:38:08.244769
> 
> Then I stopped all OSDs, which blocked all I/O to the cluster, with:
> stop ceph-osd-all
> 
> Then I looked for all copies of the PG on all OSDs with:
> for i in 3.c7 3.102 ; do find /var/lib/ceph/osd/ -maxdepth 3 -type d -name "$i" ; done | sort -V
> 
> /var/lib/ceph/osd/ceph-0/current/3.c7_head
> /var/lib/ceph/osd/ceph-0/current/3.102_head
> /var/lib/ceph/osd/ceph-3/current/3.c7_head
> /var/lib/ceph/osd/ceph-13/current/3.102_head
> /var/lib/ceph/osd/ceph-15/current/3.c7_head
> /var/lib/ceph/osd/ceph-15/current/3.102_head
> 
> Then I flushed the journals for all of those OSDs with:
> for i in 0 3 13 15 ; do ceph-osd -i $i --flush-journal ; done
> 
> Then I removed all of those drives and moved them (using Journal Fixing above) to Ithome where I used ceph_objectstore_tool to remove all traces of 3.102 and 3.c7:
> for i in 0 3 13 15 ; do for j in 3.c7 3.102 ; do ceph_objectstore_tool --data /var/lib/ceph/osd/ceph-$i --journal /var/lib/ceph/osd/ceph-$i/journal --op remove --pgid $j ; done ; done
> 
> Then I imported the PGs onto OSD.0 and OSD.15 with:
> for i in 0 15 ; do for j in 3.c7 3.102 ; do ceph_objectstore_tool --data /var/lib/ceph/osd/ceph-$i --journal /var/lib/ceph/osd/ceph-$i/journal --op import --file ~/${j}.export ; done ; done
> for i in 0 15 ; do ceph-osd -i $i --flush-journal && rm /var/log/ceph/osd/ceph-$i/journal ; done
> 
> Then I moved the disks back to Storage1 and started them all back up again. I think that this should have worked but what happened in this case was that OSD.0 didn't start up for some reason. I initially thought that that wouldn't matter because OSD.15 did start and so we should have had everything but a ceph pg query of the PGs showed something like:
> "blocked": "peering is blocked due to down osds",
> "down_osds_we_would_probe": [0],
> "peering_blocked_by": [{
>     "osd": 0,
>     "current_lost_at": 0,
>     "comment": "starting or marking this osd lost may let us proceed"
> }]
> 
> So I then removed OSD.0 from the cluster and everything came back to life. Thanks to Jean-Charles Lopez, Craig Lewis, and Paul Evans!
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com