how to recover from: 1 pgs down; 10 pgs incomplete; 10 pgs stuck inactive; 10 pgs stuck unclean

Jelle de Jong <jelledejong@xxxxxxxxxxxxx> · Mon, 13 Jul 2015 15:40:07 +0200

Hello everybody,

I was testing a ceph cluster with osd_pool_default_size = 2 and while
rebuilding the OSD on one ceph node a disk in an other node started
getting read errors and ceph kept taking the OSD down, and instead of me
executing ceph osd set nodown while the other node was rebuilding I kept
restarting the OSD for a while and ceph took the OSD in for a few
minutes and then taking it back down.

I then removed the bad OSD from the cluster and later added it back in
with nodown flag set and a weight of zero, moving all the data away.
Then removed the OSD again and added a new OSD with a new hard drive.

However I ended up with the following cluster status and I can't seem to
find how to get the cluster healthy again. I'm doing this as tests
before taking this ceph configuration in further production.

http://paste.debian.net/plain/281922

If I lost data, my bad, but how could I figure out in what pool the data
was lost and in what rbd volume (so what kvm guest lost data).

Kind regards,

Jelle de Jong
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com