Multiple OSDs suicide because of client issues?

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Sat, 21 Nov 2015 00:34:04 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

We had two interesting issues today. In both cases multiple OSDs
suicided at the exact same moment. The first incident had four OSDs,
the second had 12.

First set:
145,159,79,176

Second Set:
osd.177 down at 20:59:48,
osd.131, osd.136, osd.133, osd.139, osd.175, osd.170, osd.73 down at 20:00:03,
osd.178, osd.179 down at 21:00:07,
osd.159 down at 21:01:22,
osd.110 down at 21:01:28

Only one OSD failed both times and only a couple of boxes had more
than one OSD fail. The failures were spread out throughout the
cluster. There is nothing in dmesg/messages/sar that indicate there
was any type of hardware problem on the OSD hosts. All the OSDs
indicate slow I/O and heartbeats missing starting at 20:57:32.

The other odd thing is that most of the VMs across our 16 KVM hosts
were fine, but several VMs on one host had kernel panics. In the
messages logs of that host we see a kernel backtrace:

Nov 20 20:58:32 compute8 kernel: WARNING: CPU: 4 PID: 0 at
net/core/dev.c:2223 skb_warn_bad_offload+0xb6/0xbd()

That host's clock was exactly one minute fast. Everything points to
this host as having the issue, but I'm having a hard time
understanding how a client (or several clients) could cause several
OSDs to suicide. Can a non-responsive client in some way cause the OSD
to fault?

We have migrated all the VMs off this host and will continue to
monitor the cluster. If there is interest in the logs (librbd did not
dump any logs) from the OSDs, I can make them available.

Thanks,
- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWUB5pCRDmVDuy+mK58QAAXbIP/jZyfbAalXRr4dFpEU4n
OL5X0vBLCeg1UMbXjBRXbWlrUKHBvkruU0JCWEWEb9FFDfdkEggYwZazUVx/
b3LU8LUGWD56wtaho8/V9FbDPsRD943k6TC+FoF4TL/FFuuiJ/Elnt97Fkkg
xSBTKWS3p0I7PpSrefX+lUsDMzWJ1n6HTlYnE8SmlkkOgudh4IFFRObrv0Yr
VPOzcD3RQULdFhEdtNZYUfVudypfKz1uFyq/FtgMSQbiHeQTn0JgD6ykWKZg
9VWAVIUiHjyQn97KasCwpJjc2Vab5cUKJuUzLg72WlEKV/Q8rRcqHmxP10Pm
+h30G2N1F9JsDmEeFtYNdd/AcvsoDIRaqQ7GzJf99sJAbLQCY9VT/LWd5H52
PPUtRTHU8pr78rtVdOtQG1sxOZHvaNpPM9MQYnoxRkiCixazbO6dWVmuq32S
iEaom2J1jNxUE+RUxHMVtb+qv4jOEMHBGpdragajslqiWKZrvtsPfVyn/E0s
8m3nj67jkN4xMro3/fRJqeLUqc6QHAN/BXoTMm7flzFJyQ1fZ1l/Up8xR07J
5xtl15vOf2Xa+IVFYPkLOoV+J/mTNiIQYaQKnkqYkL2OcbOq88TFHPUJ011+
SegMD1aIYCUjLYbq+DVqarsnsJbSC51B6aR5Ko+ZOvHyMYYyRPfU4DBqGWO/
GlcH
=sFlz
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html