Re: MonClient hunt interval

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 25 Jan 2016 07:20:16 -0800



On Mon, Jan 25, 2016 at 7:03 AM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> On Mon, Jan 25, 2016 at 3:45 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>> On Mon, Jan 25, 2016 at 5:14 AM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>>> Hi Greg,
>>>
>>> With 794c86fd289b ("monc: backoff the timeout period when
>>> reconnecting") you made it so that the backoff is applied to the hunt
>>> interval.  When the session is established, the multiplier is reduced
>>> by 50% and that's it - I don't see any per-tick reduction or anything
>>> like that.
>>>
>>> If a client had some bad luck and couldn't establish the session for
>>> a while (so that the multiplier went all the way up to 10), its initial
>>> timeout upon the next connection break is going to be 15 seconds no
>>> matter how much time has passed in the interim.  Was that your intent?
>>
>> I don't remember this, but looking at the sha I logged that behavior
>> in the commit message, so I'd have to say "yes". As it says, we're
>> trying to respond to monitor load; if they're doing so badly that we
>> had to increase our timeout when re-establishing a session, there's
>> every chance it will continue to be slow. If we reset the timeout back
>> to default, we'd have to go through a lot more monitor-punishing
>> timeout rounds on the next failure than just cutting it in half would
>> take.
>
> The timeout could have been increased due to intermittent networking
> issues between the client and the monitor cluster.  The problem I see
> here is that once it's increased to 30s, it's effectively never
> decreased - since it's cut in half only once, that MonClient instance
> is stuck with 15s as its initial timeout forever.
>
> I'm not advocating resetting it back to default right away, it's just
> I expected to see some kind of slow backoff back to default.

Mmm, that might make sense. There's just also a limit to how much this
is worth worrying about — longer timeouts are bad only in the presence
of actually-dead monitors, and only when your connection to one of the
monitors dies. Any sort of gradual decay here would require more
complicated state and some mechanism for determining the monitors have
gotten happy now. Maybe you could feed it in based on response times
of other requests...
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html