PG went to Down state on OSD failure

shrey chauhan <shrey.chauhan@xxxxxxxxxxxxx> · Wed, 1 Aug 2018 16:27:00 +0530

Hi,

I am trying to understand what happens when an OSD fails.

Few days back I wanted to check what happens when an OSD goes down for that
what I did was I just went to the node and stopped one of the osd's
service. When OSD went in down and out state pgs started recovering and after
sometime everything seemed fine as everything was recovered and the osd
went in OUT and DOWN state I thought great I don't really have to worry
about loss of data on osd going down.
But recently an OSD went down on its own and I saw pgs were not able to
recover they went to down state and everything was stuck, so I had to run
this command

ceph osd lost osd_number

Which is not really safe and I might lose data here.
I am not able to understand why it did not happen when I stopped the
service the first time and why did it actually happen. As in RF2/EC21 all OSD
data is replicated/erasure coded to other osds so Ideally the cluster should have come back in normal state on its own.

Can someone please explain what am I missing here?

Should I worry about putting my production data in cluster here?

Thanks
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com