Re: RHEL3 Cluster network hangup

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Lon,

I doubt it, but it's not out of the realm of possibility.  The cluster
software does three things mostly:

(a) figures out who's online
(b) shoots nodes
(c) manages services using shell scripts

The shell scripts call standard utilities (ifconfig, route, etc.).

From the theory you are right ( and you probably know your software ). But what about this:

The software we run has "background job managers", which are started by the script I made up for the cluster. When I run lsof on such a bcj-process, it looks like this:

[root@tim root]# lsof -p 22993
COMMAND   PID USER   FD   TYPE DEVICE    SIZE     NODE NAME
plb     22993  rsi  cwd    DIR   8,19   24576  2949121 /opt/rsi/de/ham/data
plb     22993  rsi  rtd    DIR   8,10    4096        2 /
plb     22993  rsi  txt    REG   8,19 1043972 10944608 /opt/rsi/plb90f/plb
plb 22993 rsi mem REG 8,10 1571824 102450 /lib/tls/libc-2.3.2.so
plb     22993  rsi  mem    REG   8,10   14868    74214 /lib/libdl-2.3.2.so
plb 22993 rsi mem REG 8,10 97712 104282 /lib/tls/libpthread-0.60.so plb 22993 rsi mem REG 8,10 23388 73133 /lib/libcrypt-2.3.2.so plb 22993 rsi mem REG 8,6 52584 637864 /usr/lib/libz.so.1.1.4 plb 22993 rsi mem REG 8,10 213508 104281 /lib/tls/libm-2.3.2.so
plb     22993  rsi  mem    REG   8,10  106912    73243 /lib/ld-2.3.2.so
plb     22993  rsi    0r   CHR    1,3            60122 /dev/null
plb     22993  rsi    1w   CHR    1,3            60122 /dev/null
plb     22993  rsi    2w   CHR    1,3            60122 /dev/null
plb 22993 rsi 3uw REG 8,19 0 10944633 /opt/rsi/plb90f/.^A^A^A^A^A^A^A^B plb 22993 rsi 4u REG 8,19 2048 2965515 /opt/rsi/de/ham/data/cook.isi plb 22993 rsi 5u REG 8,19 4509 2949322 /opt/rsi/de/ham/data/cook.txt plb 22993 rsi 6u REG 8,19 2304 3134942 /opt/rsi/de/ham/scra/bcjmgr301100000009640.par plb 22993 rsi 7u REG 8,19 2048 868402 /opt/rsi/de/data/bct.isi plb 22993 rsi 8u REG 8,19 1512 869372 /opt/rsi/de/data/bct.txt
... data files shortened ...

If the BCJ-Process has been started by the cluster, it also has IP-sockets listet by lsof. The application is that old that it has no clue about IP, therefore it will not open IP sockets itself.

Would have added another lsof output, but I disabled the cluster software on customers demand.

Another weird fact is that the applications index file get broken if I use ext3. ext2 is fine. Applikation is technically old fashioned and stores data just is hundreds of text files with external index files and concurrent access. Without the cluster ext3 is fine as well. ( Same with LVM, besides. )

Now -- here's the thing.  Earlier versions of clumanager (<1.2.22) had a

I have been running 1.2.22.

status return and restart on the same node.  Also, the most recent
errata fixed a signal handling problem which broke JVMs from running
under it.  Either of these may have caused the problems on your cluster,
I don't know.  The former would have associated log messages; the latter
wouldn't.

There have not been any log messages.

I'd try the latest release from RHN (clumanager-1.2.26.1-1).

Hmm, I will probably not start up the cluster again... :(

If that doesn't work, I'd call Red Hat Support...

While calling support is always on option, I am pretty much sure that it will not lead to a solution. In the end they will not be able to reproduce it and I can't test on a customers production system.

Do not point me to test systems -- they are there, but they do not have the problem. Seems to be related to the workload of the machine, which is hard to simulate.

regards, Gunther
begin:vcard
fn:Gunther Schlegel
n:Schlegel;Gunther
org:Riege Software International GmbH;IT Infrastructure
adr:;;Mollsfels 10;Meerbusch;;40670;Germany
email;internet:schlegel@xxxxxxxxx
title:Manager IT Infrastructure
tel;work:+49-2159-91480
tel;fax:+49-2159-9148-11
x-mozilla-html:FALSE
url:http://riege.com
version:2.1
end:vcard

--

Linux-cluster@xxxxxxxxxx
http://www.redhat.com/mailman/listinfo/linux-cluster

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux