On Wed, Dec 7, 2016 at 2:58 PM, Samuel Just <sjust@xxxxxxxxxx> wrote: > Actually, Greg and Sage are working up other branches, nvm. > -Sam > > On Wed, Dec 7, 2016 at 2:52 PM, Samuel Just <sjust@xxxxxxxxxx> wrote: >> I just pushed a branch wip-14120-10.2.4 with a possible fix. >> >> https://github.com/ceph/ceph/pull/12349/ is a fix for a known bug >> which didn't quite make it into 10.2.4, it's possible that >> 165e5abdbf6311974d4001e43982b83d06f9e0cc which did made the bug much >> more likely to happen. wip-14120-10.2.4 has that fix cherry-picked on >> top of 10.2.4. Can you try it and let us know the result? >> -Sam Sam's explanation is correct given what we have so far. You should use wip-msgr-jewel-fix to try the backport fix though (freshly-pushed by Sage so it will be about an hour before it's available to install). In slightly more detail: you are clearly seeing a problem with the messenger, as indicated by the sock_recvmsg at the top of the CPU usage list. We've seen this elsewhere very rarely, which is why there's already a backport queued up which we didn't block on. The 15-minute period you're seeing is the default timeout we set on sockets before we start marking them closed if there's no activity. We're not quite sure why it's causing trouble now, although we have one or two patches we are speculating about and looking into. This didn't turn up in testing because as best we can tell it's only a situation you can expect to encounter when you have idle TCP connections between systems (or in fairly artificial failed networking). On Wed, Dec 7, 2016 at 3:02 PM, Ruben Kerkhof <ruben@xxxxxxxxxxxxxxxx> wrote: > On Wed, Dec 7, 2016 at 11:58 PM, Samuel Just <sjust@xxxxxxxxxx> wrote: >> Actually, Greg and Sage are working up other branches, nvm. >> -Sam > > Ok, I'll hold. If the issue is in the SimpleMessenger, would it be > safe to switch to ms type = async as a workaround? > I heard that it will become the default in Kraken, but how stable is > it in Jewel? Nobody awake right now has any certainty about the state of backports — so no, don't do that. :( -Greg _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com