Re: Failing OSDs (suicide timeout) due to flaky clients

Wido den Hollander <wido@xxxxxxxx> · Tue, 5 Jul 2016 20:59:11 +0200 (CEST)

> Op 5 juli 2016 om 20:35 schreef Gregory Farnum <gfarnum@xxxxxxxxxx>:
> 
> 
> Uh, searching for OpTracker in my github emails leads me to
> https://github.com/ceph/ceph/pull/7148
> 

Ah, yes! That's the one probably.

Looking at it this was only backported to Jewel, but not to Hammer nor Firefly.

- http://tracker.ceph.com/issues/14248
- https://github.com/ceph/ceph/commit/67be35cba7c384353b0b6d49284a4ead94c4152e

It applies cleanly on Hammer. Building packages and will see if it resolves it. Now find a way to test it and reproduce this.

Wido

> I didn't try and trace the backports but there should be links from
> the referenced Redmine ticket, or you can search the git logs.
> -Greg
> 
> On Tue, Jul 5, 2016 at 11:32 AM, Wido den Hollander <wido@xxxxxxxx> wrote:
> >
> >> Op 5 juli 2016 om 19:48 schreef Gregory Farnum <gfarnum@xxxxxxxxxx>:
> >>
> >>
> >> On Tue, Jul 5, 2016 at 10:45 AM, Wido den Hollander <wido@xxxxxxxx> wrote:
> >> >
> >> >> Op 5 juli 2016 om 19:27 schreef Gregory Farnum <gfarnum@xxxxxxxxxx>:
> >> >>
> >> >>
> >> >> On Tue, Jul 5, 2016 at 2:10 AM, Wido den Hollander <wido@xxxxxxxx> wrote:
> >> >> >
> >> >> >> Op 5 juli 2016 om 10:56 schreef huang jun <hjwsm1989@xxxxxxxxx>:
> >> >> >>
> >> >> >>
> >> >> >> i see osd timed out many times.
> >> >> >> In SimpleMessenger mode, when sending msg, the Pipeconnection will
> >> >> >> hold a lock, which maybe hold by other threads,
> >> >> >> it's reported before: http://tracker.ceph.com/issues/9921
> >> >> >>
> >> >> >
> >> >> > Thank you! It surely looks like the same symptoms we are seeing in this cluster.
> >> >> >
> >> >> > The bug has been marked as resolved, but are you sure it is?
> >> >>
> >> >> Pretty sure about that bug being done.
> >> >>
> >> >> The conntrack filling thing sounds vaguely familiar though. Is this
> >> >> the latest hammer? I think there were some leaks of messages while
> >> >> sending replies that might have blocked up incoming queues that got
> >> >> resolved later.
> >> >
> >> > Keep in mind, it's the conntrack filling up on the client which results in >50% packetloss on that client.
> >> >
> >> > The cluster is not firewalled and doesn't do any connection tracking.
> >> >
> >> > This is hammer 0.94.5, if this is fixed in .6 or .7, do you have an idea for which commit I should look? (Simple)Messenger related?
> >>
> >> If it is one of the op leaks, it'll be in the OSD OpTracker stuff to
> >> avoid keeping around message references for tracking purposes and
> >> unblocking the client Throttles.
> >
> > Thanks! I've been looking in the hammer and master branch, but was unable to find the right commit I think. Been looking for 45 minutes now, but nothing which caught my attention.
> >
> > If you have the time, would you be so kind to take a look?
> >
> > Wido
> >
> >> -Greg
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html