On Tue, 8 Sep 2009 10:13:17 +1000 (EST) "Jeff Evans" <jeffe at tricab.com> wrote: > > - server was ping'able > > - glusterfsd was disconnected by the client because of missing > > ping-pong - no login possible > > - no fs action (no lights on the hd-stack) > > - no screen (was blank, stayed blank) > > This is very similar to what I have seen many times (even back on > 1.3), and have also commented on the list. > > It seems that we have quite a few ACK's on this, or similar problems. > > The only thing different in my scenario, is that the console doesn't > stay blank. When attempting to login I get the last login message, and > nothing more, no prompt ever. Also, I can see that other processes are > still listening on sockets etc.. so it seems like the kernel just > can't grab new FD's. > > I too found the hang happens more easily if a downed node from a > replicate pair re-joins after some time. > > Following suggestions that this is all kernel related, I have just > moved up to RHEL 5.4 in the hope that the new kernel will > help. > > This fix stood out as potentially related for me: > https://bugzilla.redhat.com/show_bug.cgi?id=44543 This is an ext3 fix, unlikely that we run into a similar effect on reiserfs3, they are really very different in internals and coding. > We also have a broadcom network card, which had reports of hangs under > load, the kernel has a patch for that too. We used tg3 in this setup, but the load was not very high (below 10 MBit on a 1000MBit link). > If I still run into the hangs, I'll try xfs. I doubt that this can be a real solution. My guess is that glusterfsd runs into some race condition where it locks itself up completely. It is not funny to debug something the like on a production setup. Best would be to have debugging output sent from the servers' glusterfsd directly to a client to save the logs. I would not count on syslog in this case, if it survives one could use a serial console for syslog output though. > Thanks, Jeff. -- Regards, Stephan