Re: MonClient hunt interval

Ilya Dryomov <idryomov@xxxxxxxxx> · Mon, 25 Jan 2016 16:29:39 +0100

On Mon, Jan 25, 2016 at 4:20 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> On Mon, Jan 25, 2016 at 7:03 AM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>> On Mon, Jan 25, 2016 at 3:45 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>>> On Mon, Jan 25, 2016 at 5:14 AM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>>>> Hi Greg,
>>>>
>>>> With 794c86fd289b ("monc: backoff the timeout period when
>>>> reconnecting") you made it so that the backoff is applied to the hunt
>>>> interval.  When the session is established, the multiplier is reduced
>>>> by 50% and that's it - I don't see any per-tick reduction or anything
>>>> like that.
>>>>
>>>> If a client had some bad luck and couldn't establish the session for
>>>> a while (so that the multiplier went all the way up to 10), its initial
>>>> timeout upon the next connection break is going to be 15 seconds no
>>>> matter how much time has passed in the interim.  Was that your intent?
>>>
>>> I don't remember this, but looking at the sha I logged that behavior
>>> in the commit message, so I'd have to say "yes". As it says, we're
>>> trying to respond to monitor load; if they're doing so badly that we
>>> had to increase our timeout when re-establishing a session, there's
>>> every chance it will continue to be slow. If we reset the timeout back
>>> to default, we'd have to go through a lot more monitor-punishing
>>> timeout rounds on the next failure than just cutting it in half would
>>> take.
>>
>> The timeout could have been increased due to intermittent networking
>> issues between the client and the monitor cluster.  The problem I see
>> here is that once it's increased to 30s, it's effectively never
>> decreased - since it's cut in half only once, that MonClient instance
>> is stuck with 15s as its initial timeout forever.
>>
>> I'm not advocating resetting it back to default right away, it's just
>> I expected to see some kind of slow backoff back to default.
>
> Mmm, that might make sense. There's just also a limit to how much this
> is worth worrying about — longer timeouts are bad only in the presence
> of actually-dead monitors, and only when your connection to one of the
> monitors dies. Any sort of gradual decay here would require more
> complicated state and some mechanism for determining the monitors have
> gotten happy now. Maybe you could feed it in based on response times
> of other requests...

Well, a *really* slow decay might not need to check for whether the
monitors are happy or not and so won't require any additional state.
Anyway, I'm not super worried about this either - I'm bringing it into
the kernel client and just wanted to make sure it behaves as intended
before I merge it in.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html