Re: ping timeout

Christopher Hawkins <chawkins@xxxxxxxxxxx> · Thu, 18 Mar 2010 10:59:41 -0400 (EDT)

Thanks Stephan. But in my testing, I see the exact opposite. The hang is painful (everything stops) but the reconnect causes no problems at all. It seems to work great (good job on 3.0!) What kind of problems is it causing for you? Maybe there is something I am missing in my test setup. 

You mention that stopping and restarting glusterfsd on one box works out well... That is a reconnect, as far as I can tell. There is no hang because when you shut it down, the gluster client immediately gets a connection refused and doesn't wait for the timeout period:
[2010-03-18 10:04:46] E [socket.c:760:socket_connect_finish] master2: connection to 10.0.0.102:3302 failed (Connection refused)

As opposed to the server just going away, which hangs for a while:
[2010-03-18 10:05:44] E [client-protocol.c:415:client_ping_timer_expired] master2: Server 10.0.0.102:3302 has not responded in the last 42 seconds, disconnecting.

But when you start it up again, you should get reconnected quickly and with no problems:
[2010-03-18 09:00:00] N [afr.c:2625:notify] mirror1: Subvolume 'master1' came back up; going online.
[2010-03-18 09:00:00] N [client-protocol.c:6228:client_setvolume_cbk] master1: Connected to 10.0.0.101:3301, attached to remote volume 'threads2'.

Seems to me that disconnect / reconnect is only painful because ping timeout is so long... And on a high latency network, maybe you need that to avoid frequent little split brains, but on a low latency network, long ping timeouts seem to cause more problems than they fix. Or are you experiencing something that I am not? 

Christopher Hawkins

----- "Stephan von Krawczynski" <skraw@xxxxxxxxxx> wrote:

> Hi Christopher,
> 
> I advise you to really try the most important part of your description
> you
> take for granted - the reconnect case.
> Our experiences are quite away from what you think is the worst case.
> You can
> easily check out what happens if you just pull the network cable 5
> times in 10
> minutes. We came to the conclusion that disconnect/reconnect should be
> avoided
> under all circumstances. Interestingly stopping one servers'
> glusterfsd and
> restarting it works out quite well in our setup. So offline-updating a
> server
> (which was our main purpose) is quite ok.
> 
> -- 
> Regards,
> Stephan
> 
> 
> 
> On Thu, 18 Mar 2010 08:33:51 -0400 (EDT)
> Christopher Hawkins <chawkins@xxxxxxxxxxx> wrote:
> 
> > I have a question re: ping timeout for any of the dev's. The minimum
> value is 5 and the max is 1013... But in my case, I use replicate to
> mirror server pairs that are each gigabit connected by crossover
> cables. The latency is very low. 5 seconds is a long time and
> personally I would like them to give up on the failed link after 500ms
> or so, so the mountpoint becomes available quickly to the remaining
> node. 
> > 
> > Or I would at least like to test it and see if it's stable that way;
> I don't mind getting disconnected early in the case of a slow server,
> because it will just reconnect when the server comes back. Is there
> any hope for being able to tweak this parameter? Or is there a reason
> why it simply cannot be lower than 5?
> > 
> > Thanks for any insight and for glusterfs!
> > 
> > Christopher Hawkins
> > 
> > 
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel@xxxxxxxxxx
> > http://lists.nongnu.org/mailman/listinfo/gluster-devel
> >