Re: Error when starting ccsd and proposed patch

Mathieu Avila <mathieu.avila@xxxxxxxxxxxx> · Wed, 20 Jun 2007 10:09:54 +0200

Sorry to bother you with this ; am i the only one that spotted this
issue ?
I did review the code from cluster-2 and cluster-1.04 and the patch is
also relevant there.
A easy way of running into this problem is to generate CPU load on a
node, and then do loops of ccsd and gulm start/stop. Sometimes, gulm
will get out with an error complaining that it was unable to contact
ccsd.

Le Fri, 15 Jun 2007 10:52:08 +0200,
Mathieu Avila <mathieu.avila@xxxxxxxxxxxx> a écrit :

> Hello all,
> 
> I'm sometimes having trouble when starting ccsd and then gulm under
> heavy CPU load. Ccsd's init script tells it is running but it's not
> fully initialized.
> The problem comes from the fact that ccsd's main process returns
> before the daemonized process of ccsd has finished initializing its
> sockets. The "cluster_communicator" thread sends a SIGTERM message to
> the parent process before the main thread has finished its
> initialization work.
> 
> With the patch proposed in attachement, the cluster_communicator is
> started after the main thread has finished initializing. It works
> well under any load. Any daemon that needs to connect ccsd will
> then succceed. 
> It was tested with cluster-1.03, but it should work with older
> versions, the ccsd files didn't seem to have changed much.
> 
> --
> Mathieu Avila

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster