Re: 10.2.4 Jewel released

Ruben Kerkhof <ruben@xxxxxxxxxxxxxxxx> · Thu, 8 Dec 2016 00:17:20 +0100

Hi Gregory,

On Thu, Dec 8, 2016 at 12:10 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> In slightly more detail: you are clearly seeing a problem with the
> messenger, as indicated by the sock_recvmsg at the top of the CPU
> usage list. We've seen this elsewhere very rarely, which is why
> there's already a backport queued up which we didn't block on.
> The 15-minute period you're seeing is the default timeout we set on
> sockets before we start marking them closed if there's no activity.
>
> We're not quite sure why it's causing trouble now, although we have
> one or two patches we are speculating about and looking into.
>
> This didn't turn up in testing because as best we can tell it's only a
> situation you can expect to encounter when you have idle TCP
> connections between systems (or in fairly artificial failed
> networking).

For the OSD's doing 100% cpu, strace indeed shows EAGAIN a lot on some
of the sockets.
I'll try to get some packet captures if I can.

Kind regards,

Ruben
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com