On Mon, Jul 13, 2009 at 12:16:00PM -0700, Rick Stevens wrote: > jason@xxxxxxxxxxxxxx wrote: >> On Fri, Jul 10, 2009 at 04:50:12PM -0700, Rick Stevens wrote: >>> jason@xxxxxxxxxxxxxx wrote: >>>> hey cluster gurus.. >>>> I have a 2 node cluster thats been running without issue for quite a >>>> while.. all of a sudden one of the nodes will not completely start the >>>> apache webserver service.. it looks like this [root@tf1 ~]# clustat >>>> Member Status: Quorate >>>> Member Name Status >>>> ------ ---- ------ >>>> tf1 Online, Local, rgmanager >>>> tf2 Online, rgmanager >>>> Service Name Owner (Last) State >>>> ------- ---- ----- ------ ----- >>>> Apache Service tf1 starting >>>> postfix service tf1 started >>>> [root@tf1 ~]# and I see that the httpd is NOT started. although, if I do >>>> /etc/init.d/httpd start >>>> the service starts without issue. >>>> grepping for apache and http in the logs, I see this.. >>>> Jul 10 14:32:13 tf1 httpd: httpd shutdown failed >>>> Jul 10 14:32:52 tf1 httpd: httpd shutdown failed >>>> Jul 10 14:33:11 tf1 httpd: httpd shutdown failed >>>> Jul 10 14:33:57 tf1 httpd: Syntax error on line 117 of >>>> /etc/httpd/conf.d/ssl.conf: >>>> Jul 10 14:33:57 tf1 httpd: SSLCertificateFile: file >>>> '/etc/httpd/conf/ssl.crt/server.crt' does not exist or is empty >>>> Jul 10 14:33:57 tf1 httpd: httpd startup failed >>>> Jul 10 14:34:06 tf1 httpd: Syntax error on line 117 of >>>> /etc/httpd/conf.d/ssl.conf: >>>> Jul 10 14:34:06 tf1 httpd: SSLCertificateFile: file >>>> '/etc/httpd/conf/ssl.crt/server.crt' does not exist or is empty >>>> Jul 10 14:34:06 tf1 httpd: httpd startup failed >>>> Jul 10 14:34:08 tf1 httpd: httpd shutdown failed >>>> Jul 10 16:23:33 tf1 clurgmgrd: [6168]: <info> Executing >>>> /etc/init.d/httpd stop Jul 10 16:23:34 tf1 httpd: httpd shutdown failed >>>> Jul 10 16:24:31 tf1 httpd: httpd shutdown failed >>>> Jul 10 16:24:36 tf1 httpd: httpd shutdown failed >>>> Jul 10 16:24:41 tf1 httpd: httpd startup succeeded >>>> Jul 10 18:10:13 tf1 clurgmgrd: [6231]: <info> Executing >>>> /etc/init.d/httpd stop Jul 10 18:10:13 tf1 httpd: httpd shutdown failed >>>> Jul 10 18:22:00 tf1 httpd: httpd startup succeeded >>>> [root@tf1 log]# grep apache messages >>>> Jul 10 04:40:00 tf1 clurgmgrd[6267]: <notice> stop on script >>>> "cluster_apache" returned 1 (generic error) Jul 10 10:04:33 tf1 >>>> clurgmgrd[6149]: <notice> stop on script "cluster_apache" returned 1 >>>> (generic error) Jul 10 14:29:54 tf1 clurgmgrd[6281]: <notice> stop on >>>> script "cluster_apache" returned 1 (generic error) Jul 10 16:23:34 tf1 >>>> clurgmgrd[6168]: <notice> stop on script "cluster_apache" returned 1 >>>> (generic error) Jul 10 18:10:13 tf1 clurgmgrd[6231]: <notice> stop on >>>> script "cluster_apache" returned 1 (generic error) [root@tf1 log]# Im >>>> guessing its the stop on script "cluster_apache" returned 1 (generic >>>> error) >>>> but I looked at the /etc/init.d/httpd on tf1 and tf2 and they are both >>>> the same size >>>> [root@tf2 ~]# ls -al /etc/init.d/httpd >>>> -rwxr-xr-x 1 root root 3201 Jan 30 2007 /etc/init.d/httpd >>>> [root@tf1 log]# ls -al /etc/init.d/httpd >>>> -rwxr-xr-x 1 root root 3201 Jan 30 2007 /etc/init.d/httpd >>>> and the apache service starts/stops just fine on tf2 when the services >>>> get failed over to that machine. >>>> any ideas on what can be wrong? >>> tf1 is complaining about a bad SSL cert. The fact that it's complaining >>> when being started by clurgmgrd but not when started manually indicates >>> that clurgmgrd is starting it differently (specifying a different >>> httpd.conf file perhaps?). >> well, heres the relevant part of my config file >> <rm> >> <failoverdomains> >> <failoverdomain name="httpd" ordered="1" >> restricted="1"> >> <failoverdomainnode name="tf1" >> priority="1"/> >> <failoverdomainnode name="tf2" >> priority="2"/> >> </failoverdomain> >> </failoverdomains> >> <resources> >> <script file="/etc/init.d/httpd" >> name="cluster_apache"/> >> <ip address="192.168.1.7" monitor_link="1"/> >> <script file="/etc/init.d/postfix" >> name="cluster_posstfix"/> >> </resources> >> <service autostart="1" domain="httpd" name="Apache >> Service"> >> <ip ref="192.168.1.7"/> >> <script ref="cluster_apache"/> >> </service> >> <service autostart="1" domain="httpd" name="postfix >> service"> >> <ip ref="192.168.1.7"/> >> <script ref="cluster_posstfix"/> >> </service> >> </rm> >> ive never seen that ssl error when starting the service manually. >> the other thing that I noticed.. is that when I try to do [root@tf1 >> cluster]# clusvcadm -d "Apache Service" >> Member tf1 disabling Apache Service... >> it just hangs there and never returns. > > Sorry about the delay in responding. Was out of town for the weekend. > > Does clusvcadm or clurgmgrd run as a different user...one that either > can't read the SSL certs or the directory containing them? Normally > the stuff in /etc/init.d runs as root. Running one of those scripts as > a different user can lead to lots of permissions issues. It's bitten > me before. ok, so I think ive found out whats going on.. There is another custom program from some other 3rd party vendor that theyre trying to get going on this cluster.. It is somehow interfering with the apache service coming up. If this 3rd party application is NOT started (from /etc/init.d/rc.local) when the server boots up, then the apache service comes up fine.. If the 3rd party application IS allowed to start when the server boots up, it somehow causes the apache service to not come up correctly, and I see the Service Name Owner (Last) State ------- ---- ----- ------ ----- Apache Service tf1 starting postfix service tf1 started but like I said earlier. /etc/init.d/httpd start still works fine either way.. If the 3rd party program is NOT running the cluster services come up fine on their own. funky. thanks for the help, Jason -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster