On Sat, Nov 21, 2015 at 1:34 AM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA256 > > We had two interesting issues today. In both cases multiple OSDs > suicided at the exact same moment. The first incident had four OSDs, > the second had 12. > > First set: > 145,159,79,176 > > Second Set: > osd.177 down at 20:59:48, > osd.131, osd.136, osd.133, osd.139, osd.175, osd.170, osd.73 down at 20:00:03, > osd.178, osd.179 down at 21:00:07, > osd.159 down at 21:01:22, > osd.110 down at 21:01:28 > > Only one OSD failed both times and only a couple of boxes had more > than one OSD fail. The failures were spread out throughout the > cluster. There is nothing in dmesg/messages/sar that indicate there > was any type of hardware problem on the OSD hosts. All the OSDs > indicate slow I/O and heartbeats missing starting at 20:57:32. > > The other odd thing is that most of the VMs across our 16 KVM hosts > were fine, but several VMs on one host had kernel panics. In the > messages logs of that host we see a kernel backtrace: > > Nov 20 20:58:32 compute8 kernel: WARNING: CPU: 4 PID: 0 at > net/core/dev.c:2223 skb_warn_bad_offload+0xb6/0xbd() > > That host's clock was exactly one minute fast. Everything points to > this host as having the issue, but I'm having a hard time > understanding how a client (or several clients) could cause several > OSDs to suicide. Can a non-responsive client in some way cause the OSD > to fault? No, it shouldn't be able to just by having clock issues or whatever. There *are* still some ways a malformed request can cause the OSDs to crash, though — it looks like maybe this is a network card issue? That could have maybe flipped some bits that broke stuff. What's the backtrace on the OSDs? -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html