We have web cluster running on three dual 3GHz processor server's (RHEL
3) attached to an SATA san with a single LUN shared among them using GFS
6.0. We're running lock_gulmd on each of these server's: they're
locking server's as well as apache servers. The locking network that
attaches the servers is simply a 10Mbit hub. Network traffic's
distributed by a hardware load balancer, not RH cluster.
We suffered a melt down while doing a trial test of Movable Type
(blogging software) on the cluster. We were using a BerkleyDB backend
database housed on the shared LUN. The software was installed on all
three servers.
Once we fired up movable type we noticed that the load average on each
of the three server's was climbed a bit above normal. On one box in
particular we got up to a load average of 8 while the other two boxes
were around 2. Everything still moved along ok but we could see the
load on the (8) box inching up. We noted what appeared to be some hung
cgi processes associated with movable type. They resisted kill commands
and couldn't be 'kill -9".
So we decided to remove the highly loaded box from the cluster. The
second we ran the command the other two boxes load averages shot to
100. Shortly there after they locked up. The boxes locked up so fast
we couldn't pull any diagnostic data before they crashed.
I've seen behavior like the above when server's submit multiple i/o
requests to a SAN and for some reason they don't return in a timely
manner. The out standing i/o's make the load average climb into the
stratosphere. I'm thinking something like this happened here. However
because the server's tanked so quickly I couldn't found out for certain.
We've mulled over the possibilities as to what the heck happened. Did
concurrent access attempts from 3 servers on a berkleydb database on a
gfs partition blow us up? Should we have had a 100Mbit switch on the
locking network instead of the 10Mbit hub? Separate locking servers?
Any suggestions?
Regards,
Darren Jacobs
_
University of Toronto
--
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster