Hi,
I am trying to understand what happens when an OSD fails.
Few days back I wanted to check what happens when an OSD goes down for that
what I did was I just went to the node and stopped one of the osd's
service. When OSD went in down and out state pgs started recovering and after
sometime everything seemed fine as everything was recovered and the osd
went in OUT and DOWN state I thought great I don't really have to worry
about loss of data on osd going down.
But recently an OSD went down on its own and I saw pgs were not able to
recover they went to down state and everything was stuck, so I had to run
this command
ceph osd lost osd_number
Which is not really safe and I might lose data here.
I am not able to understand why it did not happen when I stopped the
service the first time and why did it actually happen. As in RF2/EC21 all OSD
data is replicated/erasure coded to other osds so Ideally the cluster should have come back in normal state on its own.
Can someone please explain what am I missing here?
Should I worry about putting my production data in cluster here?
Thanks
I am trying to understand what happens when an OSD fails.
Few days back I wanted to check what happens when an OSD goes down for that
what I did was I just went to the node and stopped one of the osd's
service. When OSD went in down and out state pgs started recovering and after
sometime everything seemed fine as everything was recovered and the osd
went in OUT and DOWN state I thought great I don't really have to worry
about loss of data on osd going down.
But recently an OSD went down on its own and I saw pgs were not able to
recover they went to down state and everything was stuck, so I had to run
this command
ceph osd lost osd_number
Which is not really safe and I might lose data here.
I am not able to understand why it did not happen when I stopped the
service the first time and why did it actually happen. As in RF2/EC21 all OSD
data is replicated/erasure coded to other osds so Ideally the cluster should have come back in normal state on its own.
Can someone please explain what am I missing here?
Should I worry about putting my production data in cluster here?
Thanks
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com