Re: how to recover from: 1 pgs down; 10 pgs incomplete; 10 pgs stuck inactive; 10 pgs stuck unclean

Jelle de Jong <jelledejong@xxxxxxxxxxxxx> · Wed, 15 Jul 2015 10:55:07 +0200

On 13/07/15 15:40, Jelle de Jong wrote:
> I was testing a ceph cluster with osd_pool_default_size = 2 and while
> rebuilding the OSD on one ceph node a disk in an other node started
> getting read errors and ceph kept taking the OSD down, and instead of me
> executing ceph osd set nodown while the other node was rebuilding I kept
> restarting the OSD for a while and ceph took the OSD in for a few
> minutes and then taking it back down.
> 
> I then removed the bad OSD from the cluster and later added it back in
> with nodown flag set and a weight of zero, moving all the data away.
> Then removed the OSD again and added a new OSD with a new hard drive.
> 
> However I ended up with the following cluster status and I can't seem to
> find how to get the cluster healthy again. I'm doing this as tests
> before taking this ceph configuration in further production.
> 
> http://paste.debian.net/plain/281922
> 
> If I lost data, my bad, but how could I figure out in what pool the data
> was lost and in what rbd volume (so what kvm guest lost data).

Anybody that can help?

Can I somehow reweight some OSD to resolve the problems? Or should I
rebuild the whole cluster and loose all data?

Kind regards,

Jelle de Jong
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com