Hi Gregory, On Thu, Dec 8, 2016 at 12:10 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > In slightly more detail: you are clearly seeing a problem with the > messenger, as indicated by the sock_recvmsg at the top of the CPU > usage list. We've seen this elsewhere very rarely, which is why > there's already a backport queued up which we didn't block on. > The 15-minute period you're seeing is the default timeout we set on > sockets before we start marking them closed if there's no activity. > > We're not quite sure why it's causing trouble now, although we have > one or two patches we are speculating about and looking into. > > This didn't turn up in testing because as best we can tell it's only a > situation you can expect to encounter when you have idle TCP > connections between systems (or in fairly artificial failed > networking). For the OSD's doing 100% cpu, strace indeed shows EAGAIN a lot on some of the sockets. I'll try to get some packet captures if I can. Kind regards, Ruben _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com