Darren, You will definitely want to increase the lock networks switch to at least 100M and if you have the hardware you should seriously consider adding dedicated lock servers. Your load problems are being caused by lock traffic bottlenecks in your setup. Britt -----Original Message----- From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Darren Jacobs Sent: Monday, July 17, 2006 2:57 AM To: linux clustering Subject: BerkleyDB locking problems with GFS 6.0? We have web cluster running on three dual 3GHz processor server's (RHEL 3) attached to an SATA san with a single LUN shared among them using GFS 6.0. We're running lock_gulmd on each of these server's: they're locking server's as well as apache servers. The locking network that attaches the servers is simply a 10Mbit hub. Network traffic's distributed by a hardware load balancer, not RH cluster. We suffered a melt down while doing a trial test of Movable Type (blogging software) on the cluster. We were using a BerkleyDB backend database housed on the shared LUN. The software was installed on all three servers. Once we fired up movable type we noticed that the load average on each of the three server's was climbed a bit above normal. On one box in particular we got up to a load average of 8 while the other two boxes were around 2. Everything still moved along ok but we could see the load on the (8) box inching up. We noted what appeared to be some hung cgi processes associated with movable type. They resisted kill commands and couldn't be 'kill -9". So we decided to remove the highly loaded box from the cluster. The second we ran the command the other two boxes load averages shot to 100. Shortly there after they locked up. The boxes locked up so fast we couldn't pull any diagnostic data before they crashed. I've seen behavior like the above when server's submit multiple i/o requests to a SAN and for some reason they don't return in a timely manner. The out standing i/o's make the load average climb into the stratosphere. I'm thinking something like this happened here. However because the server's tanked so quickly I couldn't found out for certain. We've mulled over the possibilities as to what the heck happened. Did concurrent access attempts from 3 servers on a berkleydb database on a gfs partition blow us up? Should we have had a 100Mbit switch on the locking network instead of the 10Mbit hub? Separate locking servers? Any suggestions? Regards, Darren Jacobs _ University of Toronto -- Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster