Re: Multiple OSDs suicide because of client issues?

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 23 Nov 2015 11:17:56 -0600

On Mon, Nov 23, 2015 at 11:03 AM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> The backtrace is:
>
> 2015-11-20 20:59:48.856679 7f7012ff7700 -1 common/HeartbeatMap.cc: In
> function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*,
> const char*, time_t)' thread 7f7012ff7700 time 2015-11-20
> 20:59:48.833166
> common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
>
>  ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x85) [0xbc9d85]
>  2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
> const*, long)+0x2d9) [0xaff1f9]
>  3: (ceph::HeartbeatMap::is_healthy()+0xde) [0xaffaee]
>  4: (OSD::handle_osd_ping(MOSDPing*)+0x733) [0x696c43]
>  5: (OSD::heartbeat_dispatch(Message*)+0x2fb) [0x697ebb]
>  6: (DispatchQueue::entry()+0x62a) [0xc84c9a]
>  7: (DispatchQueue::DispatchThread::entry()+0xd) [0xba81cd]
>  8: (()+0x7df5) [0x7f702d85ddf5]
>  9: (clone()+0x6d) [0x7f702c3401ad]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this.
>
> - --- begin dump of recent events ---
>
> We have had problems with Large Receive Offloads and KVM VMs before. I
> think this host just got missed, or maybe it is something different.
> I'm ok with a host having a hard time accessing the Ceph cluster. I'm
> a bit concerned if a misbehaving client can cause multiple OSDs to
> fault. It would be good if the OSD is resistant to things like this by
> compartmentalizing them to only those cilents/connections.

Just this backtrace doesn't help much (something was slow, and it
timed out!), but there should be a log line including "had suicide
timed out after" just ahead of it (in that thread).
I guess it's vaguely possible the LRO got busted since the network
card on your client was dead? Not really anything we can do about that
though...

>I'm attaching the entire OSD log in case it is useful.

Uh, that doesn't have the backtrace in it.
-Greg

>
> Thanks for taking a look at this.
>
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Mon, Nov 23, 2015 at 9:03 AM, Gregory Farnum  wrote:
>> No, it shouldn't be able to just by having clock issues or whatever.
>> There *are* still some ways a malformed request can cause the OSDs to
>> crash, though — it looks like maybe this is a network card issue? That
>> could have maybe flipped some bits that broke stuff. What's the
>> backtrace on the OSDs?
>> -Greg
>
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.3
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWU0bgCRDmVDuy+mK58QAAcysP/1xI6paI89WDozrmE2sY
> ehaF4sZsyy6y6mizsp+g7dXErNXtCIRQIg+LDjtS+SOnni+Z/XAhmLlCb5xM
> tid3xqQhQPLD66QhFQsxEGQxvWI5urqHnGWRhpbjpz8Xa0ReAHYCLj8K6hh0
> f7FHyqEjsEDtcqrk3+EI6bklBW7xgJy4zHQG+0MiZarzh5gSXvEpxrXo2KIr
> qBUcEE585jddVhvEv+VQVuBagQlBEMLo4RTz+5mdwneijIGAIQlOUCXVTogp
> d6aLaVQyCNMiAblJoFzr/UeV7E5ajQzd4QZ5i9H7ZD1sCwWMdV/pQNyYoDWk
> 3dBQXeYrkU2KlH14iKOJa1jxAPWg9mnnsguesir1aWunR+LamL2tbBlgXcXG
> 0NjIfl7q0yMm89jb7/JVAr8nyp3gOHdNaPRfd8FTilYoLGJFEB1j25q2qlBP
> 8IBSZbldXlXi9HB78cU3/I2o44CsrPPzZgN0iJ0fT7mbRPujkZbsdk3SbFtu
> eG1dXsZLNdSOgll5gSj11U8Kt4HvkF9dhatmqYeyZGFeBHOJqKhi0dw6yZ2T
> sSFPsHRNt6vbc8ckF4NqyFyPTK5PTSqB8TdLiZXW8vHvWooxNOtdCFgjQtNY
> kdb1kLsNW/z5dgE218kvwUnAObXaB9RkEJ47xi9o2FbVya+eHMYdM0JaEYxt
> I48o
> =Uufa
> -----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html