Hi,
this is a follow up of my previous mail, "Node fenced when mounting
gfs", in which mounting a particular GFS volume with lots of files can
cause a node (node2) to appear hang, and thus fenced by the other node
(node1).
Searching the archive I found a relevant thread, "node kicked out of
cluster", in which Patrick Caulfield comments
"DLM can hog the CPU when recovering huge numbers of locks, so we a re
looking into placing some strategic
"schedule()" calls in the recovery process."
This seems to be the case in my problem, since top shows near 100%
system time. BTW, my system is a dual Xeon box.
On another mail thread, "Configuring CMAN timer/timeout values", I found
a possible workaround by modifying /proc/cluster/config/cman/.
Increasing deadnode_timeout (i tried 2100) prevents the node2 from being
fenced, but now node1's performance dropped significantly even when it's
CPU load is very low (e.g. other servers mounting NFS from node2 keeps
getting NFS timeout errors). Am I right to assume that GFS requires
locking on both nodes during writing? If it does, this makes sence since
node2 is too busy "scanning log elements" to respond to anything. After
over 30 minutes node2 still hasn't finished "scanning log elements", so
I changed /proc/cluster/config/cman/deadnode_timeout on node1 back to
its default value (21) and node 2 gets fenced automatically.
So the questions are:
- is it normal for a "scanning log elements" process to take over 30
minutes?
- is there a method to make "scanning log elements" uses lower
priority (e.g. lowering the priority of DLM when it's recovering locks) ?
Regards,
Fajar
--
Linux-cluster@xxxxxxxxxx
http://www.redhat.com/mailman/listinfo/linux-cluster