I see what you mean. Hopefully that behavior is fixed in 3.0. Though in my case, I would still like fast disconnect because the data mirror is active / passive. There should be no problems for glusterfs to figure out which side has the new data because only one server will be receiving writes at any given time. Christopher Hawkins ----- "Stephan von Krawczynski" <skraw@xxxxxxxxxx> wrote: > On Thu, 18 Mar 2010 10:59:41 -0400 (EDT) > Christopher Hawkins <chawkins@xxxxxxxxxxx> wrote: > > > Thanks Stephan. But in my testing, I see the exact opposite. The > hang is painful (everything stops) but the reconnect causes no > problems at all. It seems to work great (good job on 3.0!) What kind > of problems is it causing for you? Maybe there is something I am > missing in my test setup. > > We experienced _server_ hangs that could only be cured by > hard-resetting the > box. We tried with glusterfs 2.X, not 3.X. > > > You mention that stopping and restarting glusterfsd on one box works > out well... That is a reconnect, as far as I can tell. There is no > hang because when you shut it down, the gluster client immediately > gets a connection refused and doesn't wait for the timeout period: > > [2010-03-18 10:04:46] E [socket.c:760:socket_connect_finish] > master2: connection to 10.0.0.102:3302 failed (Connection refused) > > Yes, of course. But the servers are healthy in this case. > > > As opposed to the server just going away, which hangs for a while: > > [2010-03-18 10:05:44] E > [client-protocol.c:415:client_ping_timer_expired] master2: Server > 10.0.0.102:3302 has not responded in the last 42 seconds, > disconnecting. > > > > But when you start it up again, you should get reconnected quickly > and with no problems: > > [2010-03-18 09:00:00] N [afr.c:2625:notify] mirror1: Subvolume > 'master1' came back up; going online. > > [2010-03-18 09:00:00] N > [client-protocol.c:6228:client_setvolume_cbk] master1: Connected to > 10.0.0.101:3301, attached to remote volume 'threads2'. > > > > Seems to me that disconnect / reconnect is only painful because ping > timeout is so long... And on a high latency network, maybe you need > that to avoid frequent little split brains, but on a low latency > network, long ping timeouts seem to cause more problems than they fix. > Or are you experiencing something that I am not? > > There is really one thing that we did not think of either in the first > place: > network packet loss. We came across the whole problem because every > now and > then pings just seem to vanish. Then, after the correspoding server > got kicked > out by the client the server entered a freeze state where its local fs > seemed > to hang indefinitely. > Even the best switches have a minimum amount of packet loss on the > network. If > you reduce the ping time to very low values you make sure that your > servers > get disconnected once a day (if you have enough data throughput). > Together > with another phenomenon - glusterfs failing to identify the latest > file > version - your data may be trash within a month of runtime. > We made these experiences during the last few months. > > -- > Regards > Stephan > > > > > > Christopher Hawkins > > > > ----- "Stephan von Krawczynski" <skraw@xxxxxxxxxx> wrote: > > > > > Hi Christopher, > > > > > > I advise you to really try the most important part of your > description > > > you > > > take for granted - the reconnect case. > > > Our experiences are quite away from what you think is the worst > case. > > > You can > > > easily check out what happens if you just pull the network cable > 5 > > > times in 10 > > > minutes. We came to the conclusion that disconnect/reconnect > should be > > > avoided > > > under all circumstances. Interestingly stopping one servers' > > > glusterfsd and > > > restarting it works out quite well in our setup. So > offline-updating a > > > server > > > (which was our main purpose) is quite ok. > > > > > > -- > > > Regards, > > > Stephan > > > > > > > > > > > > On Thu, 18 Mar 2010 08:33:51 -0400 (EDT) > > > Christopher Hawkins <chawkins@xxxxxxxxxxx> wrote: > > > > > > > I have a question re: ping timeout for any of the dev's. The > minimum > > > value is 5 and the max is 1013... But in my case, I use replicate > to > > > mirror server pairs that are each gigabit connected by crossover > > > cables. The latency is very low. 5 seconds is a long time and > > > personally I would like them to give up on the failed link after > 500ms > > > or so, so the mountpoint becomes available quickly to the > remaining > > > node. > > > > > > > > Or I would at least like to test it and see if it's stable that > way; > > > I don't mind getting disconnected early in the case of a slow > server, > > > because it will just reconnect when the server comes back. Is > there > > > any hope for being able to tweak this parameter? Or is there a > reason > > > why it simply cannot be lower than 5? > > > > > > > > Thanks for any insight and for glusterfs! > > > > > > > > Christopher Hawkins > > > > > > > > > > > > _______________________________________________ > > > > Gluster-devel mailing list > > > > Gluster-devel@xxxxxxxxxx > > > > http://lists.nongnu.org/mailman/listinfo/gluster-devel > > > > > >