Re: how to recover from: 1 pgs down; 10 pgs incomplete; 10 pgs stuck inactive; 10 pgs stuck unclean

Jelle de Jong <jelledejong@xxxxxxxxxxxxx> · Wed, 22 Jul 2015 13:36:27 +0200

On 15/07/15 10:55, Jelle de Jong wrote:
> On 13/07/15 15:40, Jelle de Jong wrote:
>> I was testing a ceph cluster with osd_pool_default_size = 2 and while
>> rebuilding the OSD on one ceph node a disk in an other node started
>> getting read errors and ceph kept taking the OSD down, and instead of me
>> executing ceph osd set nodown while the other node was rebuilding I kept
>> restarting the OSD for a while and ceph took the OSD in for a few
>> minutes and then taking it back down.
>>
>> I then removed the bad OSD from the cluster and later added it back in
>> with nodown flag set and a weight of zero, moving all the data away.
>> Then removed the OSD again and added a new OSD with a new hard drive.
>>
>> However I ended up with the following cluster status and I can't seem to
>> find how to get the cluster healthy again. I'm doing this as tests
>> before taking this ceph configuration in further production.
>>
>> http://paste.debian.net/plain/281922
>>
>> If I lost data, my bad, but how could I figure out in what pool the data
>> was lost and in what rbd volume (so what kvm guest lost data).
> 
> Anybody that can help?
> 
> Can I somehow reweight some OSD to resolve the problems? Or should I
> rebuild the whole cluster and loose all data?

# ceph pg 3.12 query
http://paste.debian.net/284812/

I used ceph pg force_create_pg x.xx on all the incomplete pgs and I
don’t have any stuck pgs any more but there are still incomplete ones.

# ceph health detail
http://paste.debian.net/284813/

How can I get the incomplete pgs active again?

Kind regards,

Jelle de Jong
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com