Well, I'm wondering now if this might all be fixed with the rc4 release that was just posted. What kind of lockup issues did that fix for? Basically I was able to replicate an issue by bringing down the first storage brick, where apache sessions would stall and bring the system load to 100+. This same issue was occurring for no apparent reason on the cluster and I wasn't able to determine a root cause. Justice London E-mail: jlondon at lawinfo.com -----Original Message----- From: Vikas Gorur [mailto:vikas at gluster.com] Sent: Friday, August 07, 2009 2:26 AM To: Justice London Cc: gluster-users at gluster.org Subject: Re: 'Primary' brick outage or reboot issues ----- "Justice London" <jlondon at lawinfo.com> wrote: > It appears that if the first brick in a replicated/distributed > configuration is rebooted or suffers some sort of a temporary issue, it both means > that the system doesn't appear to be dropped after 10 seconds from the > cluster and also that after it comes back up, pending transactions have issues > for the next 10 minutes or so. Is this a locks issue or is this a bug? If the first subvolume silently goes down (without resetting the connection) then an 'ls' will hang for 10 seconds (this is the "ping-pong" timeout) because replicate will not notice until then that the server has failed. Other operations should work fine, though. Can you elaborate what you mean by 'pending transactions' and what kind of issues they face? Vikas -- Engineer - http://gluster.com/ No virus found in this incoming message. Checked by AVG - www.avg.com Version: 8.5.392 / Virus Database: 270.13.45/2285 - Release Date: 08/06/09 05:57:00