Fail to bring OSD back to cluster

Luke Kao <Luke.Kao@xxxxxxxxxxxxx> · Wed, 4 Mar 2015 09:02:47 +0000

Hello ceph community,

We need some immediate help that our cluster is in a very strange and bad status after unexpected reboot of many OSD nodes in a very short time frame.

We have a cluster with 195 osd configured on 9 different OSD nodes, original version 0.80.5.

After some issue of the datacenter, at least 5 OSD nodes rebooted and after reboot not all OSDs goes up then trigger a lot of recovery, also many PGs goes into dead / incomplete state.

Then we try to restart OSD, and found OSD keep crashes with error "FAILED assert(log.head >= olog.tail && olog.head >= log.tail)", so we upgrade to 0.80.7 which covers fix of #9482, however we still see the error with different behavior:

0.80.5: once OSD crashes with this error, any trial to restart the OSD, it will crash with same error at the end

0.80.7: OSD can be restarted, but after some time, there will be another OSD will crash with this error

We also tried to set nobackfill and norecover flag but doesn't help.

So the cluster get stuck that we cannot bring more osd back.

Any suggestion that we may have the chance to recover the cluster?

Many thanks,

Luke Kao
MYCOM-OSI

This electronic message contains information from Mycom which may be privileged or confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying,
 distribution or any other use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by post or telephone (to the numbers or correspondence address above) or by email (at the email address
 above) immediately.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com