-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 I checked the SAR data and the disks for all the OSDs showed usual performance until 20:57:32 when over the next few minutes the I/OPs, bandwidth and latency all decreased. The only thing that I can think of is that some replies to the client got hung up and backed up the OSD process or something. There are a couple of other backtraces in the log file, but I could not trace any of them to something useful. 2015-11-20 20:59:48.867197 7f6f95637700 0 -- 10.217.89.30:6804/1028318 >> 10.217.89.12:6800/29050 pipe(0x2fdd0000 sd=35 :57978 s=2 pgs=273 cs=1 l=0 c=0x419a9700).fault with nothing to send, going to standby 2015-11-20 20:59:48.917626 7f7012ff7700 -1 *** Caught signal (Aborted) ** in thread 7f7012ff7700 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43) 1: /usr/bin/ceph-osd() [0xac8a32] 2: (()+0xf130) [0x7f702d865130] 3: (gsignal()+0x37) [0x7f702c27f5d7] 4: (abort()+0x148) [0x7f702c280cc8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f702cb839b5] 6: (()+0x5e926) [0x7f702cb81926] 7: (()+0x5e953) [0x7f702cb81953] 8: (()+0x5eb73) [0x7f702cb81b73] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xbc9f7a] 10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2d9) [0xaff1f9] 11: (ceph::HeartbeatMap::is_healthy()+0xde) [0xaffaee] 12: (OSD::handle_osd_ping(MOSDPing*)+0x733) [0x696c43] 13: (OSD::heartbeat_dispatch(Message*)+0x2fb) [0x697ebb] 14: (DispatchQueue::entry()+0x62a) [0xc84c9a] 15: (DispatchQueue::DispatchThread::entry()+0xd) [0xba81cd] 16: (()+0x7df5) [0x7f702d85ddf5] 17: (clone()+0x6d) [0x7f702c3401ad] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. - --- begin dump of recent events --- -1> 2015-11-20 20:59:48.867197 7f6f95637700 0 -- 10.217.89.30:6804/1028318 >> 10.217.89.12:6800/29050 pipe(0x2fdd0000 sd=35 :57978 s=2 pgs=273 cs=1 l=0 c=0x419a9700).fault with nothing to send, going to standby 0> 2015-11-20 20:59:48.917626 7f7012ff7700 -1 *** Caught signal (Aborted) ** in thread 7f7012ff7700 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43) 1: /usr/bin/ceph-osd() [0xac8a32] 2: (()+0xf130) [0x7f702d865130] 3: (gsignal()+0x37) [0x7f702c27f5d7] 4: (abort()+0x148) [0x7f702c280cc8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f702cb839b5] 6: (()+0x5e926) [0x7f702cb81926] 7: (()+0x5e953) [0x7f702cb81953] 8: (()+0x5eb73) [0x7f702cb81b73] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xbc9f7a] 10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2d9) [0xaff1f9] 11: (ceph::HeartbeatMap::is_healthy()+0xde) [0xaffaee] 12: (OSD::handle_osd_ping(MOSDPing*)+0x733) [0x696c43] 13: (OSD::heartbeat_dispatch(Message*)+0x2fb) [0x697ebb] 14: (DispatchQueue::entry()+0x62a) [0xc84c9a] 15: (DispatchQueue::DispatchThread::entry()+0xd) [0xba81cd] 16: (()+0x7df5) [0x7f702d85ddf5] 17: (clone()+0x6d) [0x7f702c3401ad] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. Since we took the VMs off that client, we haven't had the problem show up again. -----BEGIN PGP SIGNATURE----- Version: Mailvelope v1.2.3 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWU0yICRDmVDuy+mK58QAAyxcQAL7oA6TaXAEFLMzJRdO8 nt1LgGe0Q+l+PXqCatmk1kAKh8YM/yss0xriGCPpiar0m8KhiQtzlWOXTExk DZIoYtFR7ZVzJCU2/1gQn8I/+tcYH7naxj2mCfyBuWz71wy1FFKfvdc/tUBx h8pQ7e1w3eQfLayDw7ir/iU+iFlh4918DY61cqdblyAu5ALVvbNM1hdqVBau nAwJsfIgtJyuzUXpxEk+TbH5VaZGwly1iJ2cVHvpPuSWhM0EzFGKsKYkHJbh /XPecqMepzH6W9YK6cgmcqqKcWQoNoPoTCVvpBBkgzBCz5QiNIUobRKEx9yL pQIy0eHlE7btLREEQRJ6jXXuvaBmLzVCHYiIBP68Efe5c9JU0+ZxmVjJ/H5b gKWfi6SC80VMVyLPNEV35p+SK2UAjhmsplxpxErEkSj8U/8YdC0TzwauKwYN k48ZiIWHfDN40cgcP/RuSZMuhfvqTSIyFifIGs5ADuDe47o3SIpI6rBt5MPs ebmbvAMTT/1ez/JQ9ugJ83QKiSgPD/Sw5YffMF1S+J4mMKOGEl8mfv8HFyjo J9chHcVYrQt8T3AaGKqJqwc4C4BKTGDm314Hf+iDxsROjMMzgtbGxGyQC7vv SQnpMsQjikIZKsI/9hoAentFe9f3/ks7GZH2aEbUNTzz+BIn5pXHSycdXwb6 1TxG =FmEY -----END PGP SIGNATURE----- ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Mon, Nov 23, 2015 at 10:17 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > On Mon, Nov 23, 2015 at 11:03 AM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA256 >> >> The backtrace is: >> >> 2015-11-20 20:59:48.856679 7f7012ff7700 -1 common/HeartbeatMap.cc: In >> function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, >> const char*, time_t)' thread 7f7012ff7700 time 2015-11-20 >> 20:59:48.833166 >> common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout") >> >> ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43) >> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x85) [0xbc9d85] >> 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char >> const*, long)+0x2d9) [0xaff1f9] >> 3: (ceph::HeartbeatMap::is_healthy()+0xde) [0xaffaee] >> 4: (OSD::handle_osd_ping(MOSDPing*)+0x733) [0x696c43] >> 5: (OSD::heartbeat_dispatch(Message*)+0x2fb) [0x697ebb] >> 6: (DispatchQueue::entry()+0x62a) [0xc84c9a] >> 7: (DispatchQueue::DispatchThread::entry()+0xd) [0xba81cd] >> 8: (()+0x7df5) [0x7f702d85ddf5] >> 9: (clone()+0x6d) [0x7f702c3401ad] >> NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. >> >> - --- begin dump of recent events --- >> >> We have had problems with Large Receive Offloads and KVM VMs before. I >> think this host just got missed, or maybe it is something different. >> I'm ok with a host having a hard time accessing the Ceph cluster. I'm >> a bit concerned if a misbehaving client can cause multiple OSDs to >> fault. It would be good if the OSD is resistant to things like this by >> compartmentalizing them to only those cilents/connections. > > Just this backtrace doesn't help much (something was slow, and it > timed out!), but there should be a log line including "had suicide > timed out after" just ahead of it (in that thread). > I guess it's vaguely possible the LRO got busted since the network > card on your client was dead? Not really anything we can do about that > though... > >>I'm attaching the entire OSD log in case it is useful. > > Uh, that doesn't have the backtrace in it. > -Greg > >> >> Thanks for taking a look at this. >> >> - ---------------- >> Robert LeBlanc >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >> >> >> On Mon, Nov 23, 2015 at 9:03 AM, Gregory Farnum wrote: >>> No, it shouldn't be able to just by having clock issues or whatever. >>> There *are* still some ways a malformed request can cause the OSDs to >>> crash, though — it looks like maybe this is a network card issue? That >>> could have maybe flipped some bits that broke stuff. What's the >>> backtrace on the OSDs? >>> -Greg >> >> -----BEGIN PGP SIGNATURE----- >> Version: Mailvelope v1.2.3 >> Comment: https://www.mailvelope.com >> >> wsFcBAEBCAAQBQJWU0bgCRDmVDuy+mK58QAAcysP/1xI6paI89WDozrmE2sY >> ehaF4sZsyy6y6mizsp+g7dXErNXtCIRQIg+LDjtS+SOnni+Z/XAhmLlCb5xM >> tid3xqQhQPLD66QhFQsxEGQxvWI5urqHnGWRhpbjpz8Xa0ReAHYCLj8K6hh0 >> f7FHyqEjsEDtcqrk3+EI6bklBW7xgJy4zHQG+0MiZarzh5gSXvEpxrXo2KIr >> qBUcEE585jddVhvEv+VQVuBagQlBEMLo4RTz+5mdwneijIGAIQlOUCXVTogp >> d6aLaVQyCNMiAblJoFzr/UeV7E5ajQzd4QZ5i9H7ZD1sCwWMdV/pQNyYoDWk >> 3dBQXeYrkU2KlH14iKOJa1jxAPWg9mnnsguesir1aWunR+LamL2tbBlgXcXG >> 0NjIfl7q0yMm89jb7/JVAr8nyp3gOHdNaPRfd8FTilYoLGJFEB1j25q2qlBP >> 8IBSZbldXlXi9HB78cU3/I2o44CsrPPzZgN0iJ0fT7mbRPujkZbsdk3SbFtu >> eG1dXsZLNdSOgll5gSj11U8Kt4HvkF9dhatmqYeyZGFeBHOJqKhi0dw6yZ2T >> sSFPsHRNt6vbc8ckF4NqyFyPTK5PTSqB8TdLiZXW8vHvWooxNOtdCFgjQtNY >> kdb1kLsNW/z5dgE218kvwUnAObXaB9RkEJ47xi9o2FbVya+eHMYdM0JaEYxt >> I48o >> =Uufa >> -----END PGP SIGNATURE----- -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html