Re: RHEL3 Cluster network hangup

Gunther Schlegel <schlegel@xxxxxxxxx> · Fri, 08 Jul 2005 08:27:13 +0200

Lon,

I doubt it, but it's not out of the realm of possibility.  The cluster
software does three things mostly:

(a) figures out who's online
(b) shoots nodes
(c) manages services using shell scripts

The shell scripts call standard utilities (ifconfig, route, etc.).

From the theory you are right ( and you probably know your software ). 
But what about this:

The software we run has "background job managers", which are started by 
the script I made up for the cluster. When I run lsof on such a 
bcj-process, it looks like this:

[root@tim root]# lsof -p 22993
COMMAND   PID USER   FD   TYPE DEVICE    SIZE     NODE NAME
plb     22993  rsi  cwd    DIR   8,19   24576  2949121 /opt/rsi/de/ham/data
plb     22993  rsi  rtd    DIR   8,10    4096        2 /
plb     22993  rsi  txt    REG   8,19 1043972 10944608 /opt/rsi/plb90f/plb
plb     22993  rsi  mem    REG   8,10 1571824   102450 
/lib/tls/libc-2.3.2.so
plb     22993  rsi  mem    REG   8,10   14868    74214 /lib/libdl-2.3.2.so
plb     22993  rsi  mem    REG   8,10   97712   104282 
/lib/tls/libpthread-0.60.so
plb     22993  rsi  mem    REG   8,10   23388    73133 
/lib/libcrypt-2.3.2.so
plb     22993  rsi  mem    REG    8,6   52584   637864 
/usr/lib/libz.so.1.1.4
plb     22993  rsi  mem    REG   8,10  213508   104281 
/lib/tls/libm-2.3.2.so
plb     22993  rsi  mem    REG   8,10  106912    73243 /lib/ld-2.3.2.so
plb     22993  rsi    0r   CHR    1,3            60122 /dev/null
plb     22993  rsi    1w   CHR    1,3            60122 /dev/null
plb     22993  rsi    2w   CHR    1,3            60122 /dev/null
plb     22993  rsi    3uw  REG   8,19       0 10944633 
/opt/rsi/plb90f/.^A^A^A^A^A^A^A^B
plb     22993  rsi    4u   REG   8,19    2048  2965515 
/opt/rsi/de/ham/data/cook.isi
plb     22993  rsi    5u   REG   8,19    4509  2949322 
/opt/rsi/de/ham/data/cook.txt
plb     22993  rsi    6u   REG   8,19    2304  3134942 
/opt/rsi/de/ham/scra/bcjmgr301100000009640.par
plb     22993  rsi    7u   REG   8,19    2048   868402 
/opt/rsi/de/data/bct.isi
plb     22993  rsi    8u   REG   8,19    1512   869372 
/opt/rsi/de/data/bct.txt
... data files shortened ...

If the BCJ-Process has been started by the cluster, it also has 
IP-sockets listet by lsof. The application is that old that it has no 
clue about IP, therefore it will not open IP sockets itself.

Would have added another lsof output, but I disabled the cluster 
software on customers demand.

Another weird fact is that the applications index file get broken if I 
use ext3. ext2 is fine. Applikation is technically old fashioned and 
stores data just is hundreds of text files with external index files and 
concurrent access. Without the cluster ext3 is fine as well. ( Same with 
LVM, besides. )

Now -- here's the thing.  Earlier versions of clumanager (<1.2.22) had a

I have been running 1.2.22.

status return and restart on the same node.  Also, the most recent
errata fixed a signal handling problem which broke JVMs from running
under it.  Either of these may have caused the problems on your cluster,
I don't know.  The former would have associated log messages; the latter
wouldn't.

There have not been any log messages.

I'd try the latest release from RHN (clumanager-1.2.26.1-1).

Hmm, I will probably not start up the cluster again... :(

If that doesn't work, I'd call Red Hat Support...

While calling support is always on option, I am pretty much sure that it 
will not lead to a solution. In the end they will not be able to 
reproduce it and I can't test on a customers production system.

Do not point me to test systems -- they are there, but they do not have 
the problem. Seems to be related to the workload of the machine, which 
is hard to simulate.

regards, Gunther
begin:vcard
fn:Gunther Schlegel
n:Schlegel;Gunther
org:Riege Software International GmbH;IT Infrastructure
adr:;;Mollsfels 10;Meerbusch;;40670;Germany
email;internet:schlegel@xxxxxxxxx
title:Manager IT Infrastructure
tel;work:+49-2159-91480
tel;fax:+49-2159-9148-11
x-mozilla-html:FALSE
url:http://riege.com
version:2.1
end:vcard

--

Linux-cluster@xxxxxxxxxx
http://www.redhat.com/mailman/listinfo/linux-cluster