Re: Again DLM messages and high load

Bas van der Vlies <basv@xxxxxxx> · Wed, 25 Oct 2006 13:14:02 +0200

Riaan van Niekerk wrote:

Bas van der Vlies wrote:
WE are using:
 GFS         : CVS 1.0.3 stable
 kernel      : 2.6.17.11-sara1
 NFS-daemons : 128
 GFS-servers   : 5

This node was the master and when this message was displayed, the load 
will rise to the number of NFS daemons and nfs does not work more. We 
had to reboot the node:
 Oct 25 03:12:31 ifs4 kernel: dlm: lisa_vg5_lv1: cancel reply ret 0
 Oct 25 03:12:31 ifs4 kernel: lock_dlm: unlock sb_status 0 2,a45325d 
flags 0
 Oct 25 03:12:31 ifs4 kernel: dlm: lisa_vg5_lv1: 
process_lockqueue_reply id a50a027c state 0

I had to reboot the master node (ifs4) when the node went down the 
other nodes re-elected another master. 3 nodes use the same master and 
on one node has another master. Is this oke?:

node 1,2,4
Fence Domain:    "default" 1   2 run       - [5 2 1 3 4]

node 3:
Fence Domain:    "default" 1   2 run       - [2 5 1 3 4]

cman_tool nodes is the same for all nodes.

Regards

good day Bas

We had the EXACT same symptom (load average rising to number of NFSDs, 
NFS then becomes unresponsive - these processes actually become 
defunct), happening about 2x a week.

We have had a service request open with Red Hat for the past 3 months. 
Our biggest problem was with regards to capturing the sysrq T output, 
which we could not provide (since the problem always surfaced so 
quickly, and being a production server, our biggest concern was getting 
the service up, rather than capture debugging info) and therefore could 
not take the issue further.

We still had this problem with the DLM/GFS kernel modules accompanying 
kernel 2.6.9-42.0.EL

We loaded the DLM/GFS kernel modules accompanying kernel 2.6.9-42.0.2.EL:
GFS-kernel-smp-2.6.9-60.1
dlm-kernel-smp-2.6.9-44.2
a week and a half ago, and since then we have not seen this or two other 
problem symptoms.

The bugzilla entry we were tracking (some assertion failures):
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=199673
it was the only significant change  between kernel modules for 42.EL and 
42.0.2.EL .
The DLM errata is in http://rhn.redhat.com/errata/RHBA-2006-0702.html

I am not sure how these versions map to the CVS versions, or if our NFSD 
problem is indeed solved. However, it has never stayed away this long.

If our NFS problem does occur again, I will let you know.

greetings
Riaan

Riaan,

 Thanks for the info. We had this problem also several times in a week 
with the previous versions. Now we use the latest version from CVS 
STABLE and hit this bug again, the uptime was 50 days ;-)

Regards

--
--
********************************************************************
*                                                                  *
*  Bas van der Vlies                     e-mail: basv@xxxxxxx      *
*  SARA - Academic Computing Services    phone:  +31 20 592 8012   *
*  Kruislaan 415                         fax:    +31 20 6683167    *
*  1098 SJ Amsterdam                                               *
*                                                                  *
********************************************************************

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster