BerkleyDB locking problems with GFS 6.0?

Darren Jacobs <darren.jacobs@xxxxxxxxxxx> · Mon, 17 Jul 2006 03:56:43 -0400

We have web cluster running on three dual 3GHz processor server's (RHEL 
3) attached to an SATA san with a single LUN shared among them using GFS 
6.0.  We're running lock_gulmd on each of these server's:  they're 
locking server's as well as apache servers.  The locking network that 
attaches the servers is simply a 10Mbit hub.  Network traffic's 
distributed by a hardware load balancer, not RH cluster.

We suffered a melt down while doing a trial test of Movable Type 
(blogging software) on the cluster.   We were using a BerkleyDB backend 
database housed on the shared LUN.  The software was installed on all 
three servers.

Once we fired up movable type we noticed that the load average on each 
of the three server's was climbed a bit above normal.  On one box in 
particular we got up to a load average of 8 while the other two boxes 
were around 2.  Everything still moved along ok but we could see the 
load on the (8) box inching up.  We noted what appeared to be some hung 
cgi processes associated with movable type.  They resisted kill commands 
and couldn't be 'kill -9".

So we decided to remove the highly loaded box from the cluster.  The 
second we ran the command the other two boxes load averages shot to 
100.  Shortly there after they locked up.  The boxes locked up so fast 
we couldn't pull any diagnostic data before they crashed.

I've seen behavior like the above when server's submit multiple i/o 
requests to a SAN and for some reason they don't return in a timely 
manner.  The out standing i/o's make the load average climb into the 
stratosphere.  I'm thinking something like this happened here.  However 
because the server's tanked so quickly I couldn't found out for certain.

We've mulled over the possibilities as to what the heck happened.  Did 
concurrent access attempts from 3 servers on a berkleydb database on a 
gfs partition blow us up?  Should we have had a 100Mbit switch on the 
locking network instead of the 10Mbit hub?  Separate locking servers?

Any suggestions?

Regards,

Darren Jacobs
_
University of Toronto

--

Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster