Re: how to recover from: 1 pgs down; 10 pgs incomplete; 10 pgs stuck inactive; 10 pgs stuck unclean

Lionel Bouton <lionel+ceph@xxxxxxxxxxx> · Wed, 15 Jul 2015 13:13:03 +0200

Le 15/07/2015 10:55, Jelle de Jong a écrit :
> On 13/07/15 15:40, Jelle de Jong wrote:
>> I was testing a ceph cluster with osd_pool_default_size = 2 and while
>> rebuilding the OSD on one ceph node a disk in an other node started
>> getting read errors and ceph kept taking the OSD down, and instead of me
>> executing ceph osd set nodown while the other node was rebuilding I kept
>> restarting the OSD for a while and ceph took the OSD in for a few
>> minutes and then taking it back down.
>>
>> I then removed the bad OSD from the cluster and later added it back in
>> with nodown flag set and a weight of zero, moving all the data away.
>> Then removed the OSD again and added a new OSD with a new hard drive.
>>
>> However I ended up with the following cluster status and I can't seem to
>> find how to get the cluster healthy again. I'm doing this as tests
>> before taking this ceph configuration in further production.
>>
>> http://paste.debian.net/plain/281922
>>
>> If I lost data, my bad, but how could I figure out in what pool the data
>> was lost and in what rbd volume (so what kvm guest lost data).
> Anybody that can help?
>
> Can I somehow reweight some OSD to resolve the problems? Or should I
> rebuild the whole cluster and loose all data?

If your min_size is 2, try setting to 1 and restart each of your OSD. If
ceph -s doesn't show any progress repairing your data, you'll have to
either get developpers to help savage what can be from your disks or
rebuild the cluster with size=3 and restore your data.

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com