Re: 32 nodes limit?

Marc Aurele La France <tsi@xxxxxxxxxxx> · Tue, 23 Sep 2008 10:01:07 -0600 (Mountain Daylight Time)

On Mon, 15 Sep 2008, Christine Caulfield wrote:
Marc Aurele La France wrote:
I'm trying to move a 5TB filespace from NFS to GFS2.  I have a P4 (the
current NFS server) and 33 Opteron nodes, all running a stock 2.6.22
kernel, OpenAIS 0.80.3, and a 2.00.00 cluster suite.  For now, I've
dummied out fencing and set expected_votes to 1.  I can start/stop cman
on all nodes no problem.  With all cman's running, I've formatted,
mounted and populated the filesystem using the P4.  Proceeding through
the Opterons to mount the filesystem succeeds until the 32nd node, at
which point mount.gfs2 hangs (in "D" according to `ps ax`).  Going back,
the first 16 systems that have mounted the filesystem can still `ls` the
top level directory, but attempts to do so on the remaining systems also
get stuck in "D".  Any attempt to unmount the filesystem throws the
entire setup in "D".

Due to various considerations, moving to more recent versions is not the
preferred option at this point.  Hence my question.

CMAN/openais in RHEL5 seems to be happy up to around 48 nodes (again
this is not a QE figure, it's something we have tested in development
only) with appropriate tuning. If you are seeing problems then it might
be helpful to adjust some of the times use in the openais totem
protocol. man openais.conf will tell you something about them. Before
doing this though it's worth checking the output of "group_tool" command
and syslog to see if there are any openais or other daemon errors that
might be causing your problems. If necessary post them to this list.

It's also worth mentioning that 2.00.00 has had a considerable number of
bugfixes applied since it was released and the current version is
2.03.07. I do strongly recommend you upgrade to this version even though
you say it is not "the preferred option at this point".

I hope this helps,

It most certainly does.  Thanks for the hint.  It turns out I had 
neglected to copy over my openais configuration from a test cluster. 
Everything seems to work now.

FWIW, upgrading to L&G versions is not the preferred option at this point 
primarily due to the PITFA the kernel invariably creates with its 
incompatible changes to internal APIs.  I have a number of external 
additions to deal with, and not all of them are likely to have been ported 
to the latest kernels.

Anyway, sorry for the noise, but thanks for your time.  Much appreciated.

Marc.

+----------------------------------+----------------------------------+
|  Marc Aurele La France           |  work:   1-780-492-9310          |
|  Academic Information and        |  fax:    1-780-492-1729          |
|    Communications Technologies   |  email:  tsi@xxxxxxxxxxx         |
|  352 General Services Building   +----------------------------------+
|  University of Alberta           |                                  |
|  Edmonton, Alberta               |    Standard disclaimers apply    |
|  T6G 2H1                         |                                  |
|  CANADA                          |                                  |
+----------------------------------+----------------------------------+

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster