'Primary' brick outage or reboot issues

jlondon at lawinfo.com (Justice London) · Fri, 7 Aug 2009 08:28:39 -0700

Well, I'm wondering now if this might all be fixed with the rc4 release that
was just posted. What kind of lockup issues did that fix for? 

Basically I was able to replicate an issue by bringing down the first
storage brick, where apache sessions would stall and bring the system load
to 100+. This same issue was occurring for no apparent reason on the cluster
and I wasn't able to determine a root cause.

Justice London
E-mail:  jlondon at lawinfo.com

-----Original Message-----
From: Vikas Gorur [mailto:vikas at gluster.com] 
Sent: Friday, August 07, 2009 2:26 AM
To: Justice London
Cc: gluster-users at gluster.org
Subject: Re: 'Primary' brick outage or reboot issues

----- "Justice London" <jlondon at lawinfo.com> wrote:

> It appears that if the first brick in a replicated/distributed
> configuration is rebooted or suffers some sort of a temporary issue, it
both means
> that the system doesn't appear to be dropped after 10 seconds from the
> cluster and also that after it comes back up, pending transactions have
issues
> for the next 10 minutes or so. Is this a locks issue or is this a bug?

If the first subvolume silently goes down (without resetting the connection)
then an 'ls' will hang for 10 seconds (this is the "ping-pong" timeout)
because
replicate will not notice until then that the server has failed. Other
operations
should work fine, though.

Can you elaborate what you mean by 'pending transactions' and what kind of
issues they face?

Vikas
-- 
Engineer - http://gluster.com/

No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 8.5.392 / Virus Database: 270.13.45/2285 - Release Date: 08/06/09
05:57:00