RE: GFS lock held for 12 hours

"Kovacs, Corey J." <cjk@xxxxxxxxxx> · Fri, 19 May 2006 07:07:13 -0400

Keith, this sounds like what might happen if the SAN fabric is 
not properly configured. The Qlogic cards will send a scsi reset
down the bus when activated. If your fabric is open to the point
where nodes can see each other , then the other nodes will receive
the scsi reset and there HBA's will go through a LIP reset. This
is not good, especially when you have more than a few machine in 
a cluster. The result is that GFS cannot see the disks for a while
and in most cases that would be for too long and nodes end up getting
fenced. 

If this is indeed the case, I'd suggest making separate zones for each
HBA<-->Storage device combination. For instance, a cluster with 3 nodes
and two HBA's each would end up having 6 zones. Each zone conisting if
an HBA and the storage it needs to access. They'll all be the same 
logically but they'll each be isolated from one another.

Also, there is a problem with the RHEL3 based GFS in that it doesn't
seem to play nice with the system with respect to lock space and memory.
GFS will in fact hog all the memory it can (for performance reasons)
to a point where the system itself cannot fork any processes. The way
around this is (in U7) manage the inoded_purge parameter for each mounted
GFS filesystem. inoded_purge is a percentage of locks held by GFS to try
and purge thereby releasing that mamory back to the system. It appears 
even if you cannot fork, lock_gulmd can still respond to the other nodes
indicating all is well when in fact it is not. The developers can surely
correct me if I am wrong but that's the way it acts. Seems to me by making
the response be a separate thread, this could be avoided since then
lock_gulmd
wold not be able to respond to the cluster heartbeat subssytem and it would 
get fenced. I'm sure there is more to it than that tho.

If I set my inoded_purge numbers to 20 and fire up an rsync I stay right
around 30,000 locks. My ststems (with 3GB ram) would get into a non-forking
state at around 380k locks.

Hope this helps.

Corey

-----Original Message-----
From: linux-cluster-bounces@xxxxxxxxxx
[mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Keith Lewis
Sent: Thursday, May 18, 2006 9:27 PM
To: Linux-cluster@xxxxxxxxxx
Subject:  GFS lock held for 12 hours

We have a GFS cluster with 12 data nodes and 3 lock-servers.

Red Hat AS3 U7 GFS-6.0.2.30-0

The data nodes all access a SAN disk.

The SAN fabric is divided into two independent halves - called Red and Blue.
Half the data nodes on each.  The data nodes access only one disk - reachable
via either SAN.

There are other clients, other clusters and other disks sharing the SAN.

Recently a faulty HBA was plugged into a machine, not part of our cluster,
and connected to the Red SAN.

At this point the Red SAN failed, there were two main moderately immediate
results:

One of the Red SAN nodes became very busy.  Presumably it was holding a
fairly big GFS lock at the time.  But it continued to hold the lock and to
send heartbeats.  The node gave the appearance of being hung.

The rest of the Red SAN nodes, over a period of a few minutes, all presumably
did some IO to the disk and presumably got into a busy wait state, which was
so tight that they stopped sending heartbeats, and got fenced.  (APC PDU's)

On reboot these nodes could see the SAN as normal except they could not see
their SAN disk.  Nor could they see another disk added to the SAN as part of
the debugging attempted later.  

Many attempts were made to make the disk reappear, mostly by rebooting or
shutting down GFS and rmmod-ing qla2300 and modprobe-ing qla2300.  Everything
was quite normal, except the Red SAN would not let any of our nodes see our
disk.

On the Blue SAN all the machines became very busy.  Presumably because of the
one Red SAN machine holding the lock.  These nodes were also thought to be
hung, but none of them were rebooted as it was discovered that they were
still exporting an important Web tree that was not on GFS disk.  (They sprang
back to life when the one - lock holding - Red SAN machine was rebooted -
which was well after the Red SAN problem was fixed).

This state of affairs lasted 12 hours.  

Fixing it was made difficult because to anyone looking at the problem it
appeared the entire SAN and the entire cluster was down.  Very little that we
saw at the time indicated that only the Red SAN had failed.  (Hindsight is
wonderful).

This was particularly unfortunate.  The justification for installing GFS was
resilience in the face of hardware failure.  (esp no spof).

So finally here are my questions:  

Is it really reasonable for a machine to hang onto a lock for 12 hours ?

Would it be possible for a GFS machine to detect that it cannot do IO to its
GFS disk any more and release any locks it holds - perhaps by fencing itself?

(I'm thinking of adding a cronjob that forks a subprocess that does an IO to
the GFS disk.  The parent could shutdown the node, leading to a fence, if the
child takes more than a minute).

Have I made any mistakes in my guesses and presumptions ?

Keith Lewis

--

Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--

Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster