Re: [Linux-cluster] Some GDLM questions

David Teigland <teigland@xxxxxxxxxx> · Mon, 5 Jul 2004 10:39:47 +0800

> I understand the above but its still not clear to me how a
> locking application would get fenced. On startup the application
> could check that the cluster member has joined the fence domain.
> This will ensure that it gets fenced if something goes wrong.
> 
> What's not clear is how the fence process will shut down (or
> suspend) the locking application while fencing the node. Fencing
> seems to be related to blocking access to I/O devices.

I'm not entirely sure what you're asking, but I hope a long and broad answer
might answer it.

say there's a two node cluster of nodes A and B
both nodes are running cman, fence, dlm and some application using the dlm

1. node A: hangs and is unresponsive
2. node B: cman detects that A has failed
3. node B: all cluster services are stopped/suspended
           (these services are fence and dlm in this example)
4. node B: while dlm service is stopped, it blocks all lock requests
5. node B: cluster still has quorum because of special "two_node" config
6. node B: fence service is started/enabled
7. node B: fence service fences node A
8. node B: dlm service is started/enabled
9. node B: dlm service recovers the application's lock space and
           lock requests proceed as usual

If the fencing method in step 7 only blocks access to i/o devices from node A,
node A could potentially "revive" and continue running.  The dlm on node B no
longer accepts A as a member of the lockspace so any dlm messages from A will
be ignored by B.

Depending on the application this may not be sufficient to prevent a revived
node A from causing problems.  If so, the simplest thing is to use a fencing
method that resets the power on node A rather than simply blocking its device
i/o.

-- 
Dave Teigland  <teigland@xxxxxxxxxx>