Cascading failure

GuangYang <yguang11@xxxxxxxxxxx> · Wed, 29 Jul 2015 08:55:40 -0700

Hi Cephers,
I have a (test) ceph cluster, on which I had some wrong CRUSH weight (my mistake to set wrong CRUSH weight), then I tried to set the correct CRUSH weight (e.g. change the weight from 20 to 5), right after that, the cluster became cascading failure mode, lots of OSDs starts getting down, and as I started some of them, others went down. I tried the following:
  1> Stop all the client traffic. I changed the weight with the client traffic, which triggered the failures.
  2> Increase osd_op_thread_timeout to make sure OSD does not crash itself due to heavy load (e.g. load from peering).
  3> Set osd nodown flag sometimes to avoid too many map changes back and forth.

With all those changes, I don't see OSD crash anymore, however, I am still not able to bring the down OSD up, although the daemon themselves are alive, and that seems stuck forever. For those (down) OSDs, they are still processing the peering event, from my observation, they are in that state for at least 12 hours, constantly log the following messages, and do it over and over again.

2015-07-29 15:49:16.953804 7f46b7281700 10 osd.297 pg_epoch: 56499 pg[6.cd7s0( v 3699'7919 lc 0'0 (0'0,3699'7919] local-les=56499 n=7919 ec=387 les/c 56237/2872 56496/56498/56498) [297,512,1,118,127,161,99,132,23,451,2147483647] r=0 lpr=56498 pi=2639-56497/316 crt=3699'7919 mlcod 0'0 inactive m=7919 u=7919] search_for_missing 2f402cd7/default.12598.168_osd042c014.cos.bf2.yahoo.com_7a46ad192c5fdb7ff235ef9a2b68760f/head//6 2930'4649 also missing on osd.297(0)
2015-07-29 15:49:16.963061 7f46b7281700 10 osd.297 pg_epoch: 56499 pg[6.cd7s0( v 3699'7919 lc 0'0 (0'0,3699'7919] local-les=56499 n=7919 ec=387 les/c 56237/2872 56496/56498/56498) [297,512,1,118,127,161,99,132,23,451,2147483647] r=0 lpr=56498 pi=2639-56497/316 crt=3699'7919 mlcod 0'0 inactive m=7919 u=7919] search_for_missing 80502cd7/default.12598.122_osd033c014.cos.bf2.yahoo.com_b9d89ace80b6ff88485269fd20676697/head//6 2829'1511 also missing on osd.297(0)
2015-07-29 15:49:16.972281 7f46b7281700 10 osd.297 pg_epoch: 56499 pg[6.cd7s0( v 3699'7919 lc 0'0 (0'0,3699'7919] local-les=56499 n=7919 ec=387 les/c 56237/2872 56496/56498/56498) [297,512,1,118,127,161,99,132,23,451,2147483647] r=0 lpr=56498 pi=2639-56497/316 crt=3699'7919 mlcod 0'0 inactive m=7919 u=7919] search_for_missing e1502cd7/default.12598.135_osd004c014.cos.bf2.yahoo.com_ffc8e0df7eee2ff7fb0ddf6fe90d18b0/head//6 2783'1 also missing on osd.297(0)
2015-07-29 15:49:16.981487 7f46b7281700 10 osd.297 pg_epoch: 56499 pg[6.cd7s0( v 3699'7919 lc 0'0 (0'0,3699'7919] local-les=56499 n=7919 ec=387 les/c 56237/2872 56496/56498/56498) [297,512,1,118,127,161,99,132,23,451,2147483647] r=0 lpr=56498 pi=2639-56497/316 crt=3699'7919 mlcod 0'0 inactive m=7919 u=7919] search_for_missing b4502cd7/default.14462.240_osd030c014.cos.bf2.yahoo.com_cd3acb61271cffa73b9bb6d1622ff294/head//6 2845'3668 also missing on osd.297(0)

I have several questions:
  1> What could trigger that OSD down (when the daemon is alive), the only thing I can think of is that the OSD does not respond to a heartbeat ping, but I failed to find some logs for that.
  2> I thought setting the nodown flag should help for such case, but even after a long time, when I reset the flag, those OSDs are kicked down immediately. Is that expected.

Thanks,
Guang 		 	   		  --
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html