On Wed, 2005-07-06 at 08:30 +0200, Gunther Schlegel wrote: > The clustered application does a lot of printing (lprng), > faxing(hylafax) and mailing(sendmail). It uses shell scripts to pass the > jobs to the operating systems daemons. > The client programs of these daemons, which pass jobs to the daemons > using network connections to localhost start to behave irregular when > the cluster is up for about 2 weeks. > Examples: > - hylafax faxstat stops listing the transmitted faxes in the middle of > the list ( but always at the same job ) > - sendmail opens a connection to the local daemon but does not transfer > the message. Both processes sit there and wait, after some time the > server closes the connection because of missing input from the clients side. > - same with lpr. > > I assume that something locks up in the ip stack. Not all services are > affected at the same time. > > I guess this is related to the cluster software as we run that > application on a lot of servers which all do not show this behaviour and > that are all not clustered. I doubt it, but it's not out of the realm of possibility. The cluster software does three things mostly: (a) figures out who's online (b) shoots nodes (c) manages services using shell scripts The shell scripts call standard utilities (ifconfig, route, etc.). Now -- here's the thing. Earlier versions of clumanager (<1.2.22) had a problem where sometimes (and randomly!), services would get a bogus status return and restart on the same node. Also, the most recent errata fixed a signal handling problem which broke JVMs from running under it. Either of these may have caused the problems on your cluster, I don't know. The former would have associated log messages; the latter wouldn't. I'd try the latest release from RHN (clumanager-1.2.26.1-1). If that doesn't work, I'd call Red Hat Support... -- Lon -- Linux-cluster@xxxxxxxxxx http://www.redhat.com/mailman/listinfo/linux-cluster