Some progress...
We had another dlm_sendd lockup yesterday which prompted us to do some
reworking of our file sharing. Previously we had both SMB and NFS
services competing for GFS resources on this particular node. We
thought perhaps it was this combination which may have provoked the
lockups... so, we moved things around with the help of another server
in our GFS cluster.
Previously we had:
Machine A (nfs and smb services sitting on top of gfs)
NFS SMB
GFS
And switched things around to this:
Machine A
SMB
NFS -> Machine B
Machine B
NFS
GFS
Basically we moved all NFS mounts to machine B.... NFS is the only
file sharing service using GFS on this machine, and changed Machine A
to use an NFS mount to machine B. This way we don't have any nodes
with both SMB and NFS services running on top of GFS.
Previously we had 1-2 lockups a day, but today nothing... so far so
good. Not sure if this configuration will work for you... let me
know if you need any further clarification.
d
On 1-Apr-08, at 5:51 PM, Andrew A. Neuschwander wrote:
My symptoms are similar. dlm_send sits on all of the cpu. Top shows
the
cpu spending nearly all of it's time in sys or interrupt handling.
Disk
and network I/O isn't very high (as seen via iostat and iptraf). But
SMB/NFS throughput and latency are horrible. Context switches per
second
as seen by vmstat are in the 20,000+ range (I don't now if this is
high
though, I haven't really paid attention to this in the past). Nothing
crashes, and it is still able to serve data (very slowly), and
eventually
the load and latency recovers.
As an aside, does anyone know how to _view_ the resource group size
after
file system creation on GFS?
Thanks,
-Andrew
On Tue, April 1, 2008 6:30 pm, David Ayre wrote:
What do you mean by pounded exactly ?
We have an ongoing issue, similar... when we have about a dozen users
using both smb/nfs, and at some seemingly random point in time our
dlm_senddd chews up 100% of the CPU... then dies down at on its own
after quite a while. Killing SMB processes, shutting down SMB didn't
seem to have any affect... only a reboot cures it. I've seen this
described (if this is the same issue) as a "soft lockup" as it does
seem to come back to life:
http://lkml.org/lkml/2007/10/4/137
We've been assuming its a kernel/dlm version as we are running
2.6.9-55.0.6.ELsmp with dlm-kernel 2.6.9-46.16.0.8
we were going to try a kernel update this week... but you seem to be
using a later version and still have this problem ?
Could you elaborate on "getting pounded by dlm" ? I've posted about
this on this list in the past but received no assistance.
On 1-Apr-08, at 5:19 PM, Andrew A. Neuschwander wrote:
I have a GFS cluster with one node serving files via smb and nfs.
Under
fairly light usage (5-10 users) the cpu is getting pounded by dlm. I
am
using CentOS5.1 with the included kernel (2.6.18-53.1.14.el5). This
sounds
like the dlm issue mentioned back in March of last year
(https://www.redhat.com/archives/linux-cluster/2007-March/msg00068.html
)
that was resolved in 2.6.21.
Has (or will) this fix be back ported to the current el5 kernel?
Will it
be in RHEL5.2? What is the easiest way for me to get this fix?
Also, if I try a newer kernel on this node, will there be any harm
in the
other nodes using their current kernel?
Thanks,
-Andrew
--
Andrew A. Neuschwander, RHCE
Linux Systems Administrator
Numerical Terradynamic Simulation Group
College of Forestry and Conservation
The University of Montana
http://www.ntsg.umt.edu
andrew@xxxxxxxxxxxx - 406.243.6310
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
~_~_~_~_~_~_~_~_~_~_~_~_~_~_~_~_~_~_~_~_~_~
David Ayre
Programmer/Analyst - Information Technlogy Services
Emily Carr Institute of Art and Design
Vancouver, B.C. Canada
604-844-3875 / david@xxxxxxxx
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
~_~_~_~_~_~_~_~_~_~_~_~_~_~_~_~_~_~_~_~_~_~
David Ayre
Programmer/Analyst - Information Technlogy Services
Emily Carr Institute of Art and Design
Vancouver, B.C. Canada
604-844-3875 / david@xxxxxxxx
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster