RE: BerkleyDB locking problems with GFS 6.0?

"Treece, Britt" <Britt.Treece@xxxxxxxxxx> · Mon, 17 Jul 2006 09:25:39 -0500

Darren,

You will definitely want to increase the lock networks switch to at
least 100M and if you have the hardware you should seriously consider
adding dedicated lock servers.  

Your load problems are being caused by lock traffic bottlenecks in your
setup.

Britt

-----Original Message-----
From: linux-cluster-bounces@xxxxxxxxxx
[mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Darren Jacobs
Sent: Monday, July 17, 2006 2:57 AM
To: linux clustering
Subject:  BerkleyDB locking problems with GFS 6.0?

We have web cluster running on three dual 3GHz processor server's (RHEL 
3) attached to an SATA san with a single LUN shared among them using GFS

6.0.  We're running lock_gulmd on each of these server's:  they're 
locking server's as well as apache servers.  The locking network that 
attaches the servers is simply a 10Mbit hub.  Network traffic's 
distributed by a hardware load balancer, not RH cluster.

We suffered a melt down while doing a trial test of Movable Type 
(blogging software) on the cluster.   We were using a BerkleyDB backend 
database housed on the shared LUN.  The software was installed on all 
three servers.

Once we fired up movable type we noticed that the load average on each 
of the three server's was climbed a bit above normal.  On one box in 
particular we got up to a load average of 8 while the other two boxes 
were around 2.  Everything still moved along ok but we could see the 
load on the (8) box inching up.  We noted what appeared to be some hung 
cgi processes associated with movable type.  They resisted kill commands

and couldn't be 'kill -9".

So we decided to remove the highly loaded box from the cluster.  The 
second we ran the command the other two boxes load averages shot to 
100.  Shortly there after they locked up.  The boxes locked up so fast 
we couldn't pull any diagnostic data before they crashed.

I've seen behavior like the above when server's submit multiple i/o 
requests to a SAN and for some reason they don't return in a timely 
manner.  The out standing i/o's make the load average climb into the 
stratosphere.  I'm thinking something like this happened here.  However 
because the server's tanked so quickly I couldn't found out for certain.

We've mulled over the possibilities as to what the heck happened.  Did 
concurrent access attempts from 3 servers on a berkleydb database on a 
gfs partition blow us up?  Should we have had a 100Mbit switch on the 
locking network instead of the 10Mbit hub?  Separate locking servers?

Any suggestions?

Regards,

Darren Jacobs
_
University of Toronto

--

Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--

Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster