Lon,
I doubt it, but it's not out of the realm of possibility. The cluster
software does three things mostly:
(a) figures out who's online
(b) shoots nodes
(c) manages services using shell scripts
The shell scripts call standard utilities (ifconfig, route, etc.).
From the theory you are right ( and you probably know your software ).
But what about this:
The software we run has "background job managers", which are started by
the script I made up for the cluster. When I run lsof on such a
bcj-process, it looks like this:
[root@tim root]# lsof -p 22993
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
plb 22993 rsi cwd DIR 8,19 24576 2949121 /opt/rsi/de/ham/data
plb 22993 rsi rtd DIR 8,10 4096 2 /
plb 22993 rsi txt REG 8,19 1043972 10944608 /opt/rsi/plb90f/plb
plb 22993 rsi mem REG 8,10 1571824 102450
/lib/tls/libc-2.3.2.so
plb 22993 rsi mem REG 8,10 14868 74214 /lib/libdl-2.3.2.so
plb 22993 rsi mem REG 8,10 97712 104282
/lib/tls/libpthread-0.60.so
plb 22993 rsi mem REG 8,10 23388 73133
/lib/libcrypt-2.3.2.so
plb 22993 rsi mem REG 8,6 52584 637864
/usr/lib/libz.so.1.1.4
plb 22993 rsi mem REG 8,10 213508 104281
/lib/tls/libm-2.3.2.so
plb 22993 rsi mem REG 8,10 106912 73243 /lib/ld-2.3.2.so
plb 22993 rsi 0r CHR 1,3 60122 /dev/null
plb 22993 rsi 1w CHR 1,3 60122 /dev/null
plb 22993 rsi 2w CHR 1,3 60122 /dev/null
plb 22993 rsi 3uw REG 8,19 0 10944633
/opt/rsi/plb90f/.^A^A^A^A^A^A^A^B
plb 22993 rsi 4u REG 8,19 2048 2965515
/opt/rsi/de/ham/data/cook.isi
plb 22993 rsi 5u REG 8,19 4509 2949322
/opt/rsi/de/ham/data/cook.txt
plb 22993 rsi 6u REG 8,19 2304 3134942
/opt/rsi/de/ham/scra/bcjmgr301100000009640.par
plb 22993 rsi 7u REG 8,19 2048 868402
/opt/rsi/de/data/bct.isi
plb 22993 rsi 8u REG 8,19 1512 869372
/opt/rsi/de/data/bct.txt
... data files shortened ...
If the BCJ-Process has been started by the cluster, it also has
IP-sockets listet by lsof. The application is that old that it has no
clue about IP, therefore it will not open IP sockets itself.
Would have added another lsof output, but I disabled the cluster
software on customers demand.
Another weird fact is that the applications index file get broken if I
use ext3. ext2 is fine. Applikation is technically old fashioned and
stores data just is hundreds of text files with external index files and
concurrent access. Without the cluster ext3 is fine as well. ( Same with
LVM, besides. )
Now -- here's the thing. Earlier versions of clumanager (<1.2.22) had a
I have been running 1.2.22.
status return and restart on the same node. Also, the most recent
errata fixed a signal handling problem which broke JVMs from running
under it. Either of these may have caused the problems on your cluster,
I don't know. The former would have associated log messages; the latter
wouldn't.
There have not been any log messages.
I'd try the latest release from RHN (clumanager-1.2.26.1-1).
Hmm, I will probably not start up the cluster again... :(
If that doesn't work, I'd call Red Hat Support...
While calling support is always on option, I am pretty much sure that it
will not lead to a solution. In the end they will not be able to
reproduce it and I can't test on a customers production system.
Do not point me to test systems -- they are there, but they do not have
the problem. Seems to be related to the workload of the machine, which
is hard to simulate.
regards, Gunther
begin:vcard
fn:Gunther Schlegel
n:Schlegel;Gunther
org:Riege Software International GmbH;IT Infrastructure
adr:;;Mollsfels 10;Meerbusch;;40670;Germany
email;internet:schlegel@xxxxxxxxx
title:Manager IT Infrastructure
tel;work:+49-2159-91480
tel;fax:+49-2159-9148-11
x-mozilla-html:FALSE
url:http://riege.com
version:2.1
end:vcard
--
Linux-cluster@xxxxxxxxxx
http://www.redhat.com/mailman/listinfo/linux-cluster