On Sun, 31 Jan 2010 00:29:55 +0000 Gordan Bobic <gordan@xxxxxxxxxx> wrote: > Stephan von Krawczynski wrote: > > On Fri, 29 Jan 2010 18:41:10 +0000 > > Gordan Bobic <gordan@xxxxxxxxxx> wrote: > > > >> I'm seeing things like this in the logs, coupled with things locking up > >> for a while until the timeout is complete: > >> > >> [2010-01-29 18:29:01] E > >> [client-protocol.c:415:client_ping_timer_expired] home2: Server > >> 10.2.0.10:6997 has not responded in the last 42 seconds, disconnecting. > >> [2010-01-29 18:29:01] E > >> [client-protocol.c:415:client_ping_timer_expired] home2: Server > >> 10.2.0.10:6997 has not responded in the last 42 seconds, disconnecting. > >> > >> The thing is, I know for a fact that there is no network outage of any > >> sort. All the machines are on a local gigabit ethernet, and there is no > >> connectivity loss observed anywhere else. ssh sessions going to the > >> machines that are supposedly "not responding" remain alive and well, > >> with no lag. > > > > What you're seeing here is exactly what made us increase the ping-timeout to > > 120. > > To us it is obvious that the keep alive strategy does not cope with minimal > > packet loss. On _every_ network you can see packet loss (read the docs of your > > switch carefully). We had the impression that the strategy implemented is not > > aware of the fact that a lost ping packet is no proof for a disconnected > > server but only a hint for a closer look. > > It sounds like there needs to be more heartbeats/minute. 1 packet per 10 > seconds might be a good figure to start with, but I cannot see that even > 1 packet / second would be harmful unless the number of nodes starts to > get very large, and disconnection should be triggered only after some > threshold number (certainly > 1) of those get lost in a row. Are there > options to tune such parameters in the volume spec file? Really, if you walk that way you should definitely have tuneable parameters, because 1 per second is probably no good idea over (slow) wan. I have found none so far ... Slightly offtopic I would like to ask if you, too, experienced glusterfs using a lot more bandwith than a comparable nfs connection on the server network side. It really looks a bit like a waste of resources to me... -- Regards, Stephan