Re: Troubleshooting methods for failed process

Andreas Kurz <andreas@xxxxxxxxxxx> · Thu, 10 Nov 2011 09:54:43 +0100

On 11/10/2011 05:05 AM, Earl Ruby wrote:
> After thinking about it a bit more I noticed that the Apache log shows
> the "caught SIGTERM, shutting down" message 1 second after the start
> message, so I thought maybe Pacemaker wasn't allowing Apache enough time
> to start, so I manually set the timeout for the start operation to 40s
> (by default it should be 40s already) (see bottom of message for my config).
> 
> This did not fix the problem.
> 
> I did find /usr/lib/ocf/resource.d/heartbeat/apache, which is what
> Pacemaker uses to start, stop, and monitor Apache. When I run it
> manually, to start Apache, "waiting for apache /etc/apache2/httpd.conf
> to come up" is followed IMMEDIATELY by the kill attempt. It does not
> wait 40s for the start to timeout:
> 
> # OCF_ROOT=/usr/lib/ocf /usr/lib/ocf/resource.d/heartbeat/apache start
> apache[27172]: INFO: apache not running
> apache[27172]: INFO: waiting for apache /etc/apache2/httpd.conf to come up
> /usr/lib/ocf/resource.d/heartbeat/apache: line 440: kill: (27389) - No
> such process
> apache[27172]: INFO: Killing apache PID 27389
> apache[27172]: INFO: apache stopped.
> 
> 
> If I try to monitor Apache while it's off:
> 
> # OCF_ROOT=/usr/lib/ocf /usr/lib/ocf/resource.d/heartbeat/apache monitor
> apache[30211]: INFO: apache not running
> 
> 
> ... which is correct. If I then manually start Apache and then run
> "monitor" it shows that it's running, so Pacemaker *could* tell that
> Apache is running if it was working right:
> 
> # rcapache2 start
> Starting httpd2 (prefork)
>                        done
> 
> # rcapache2 start
> Apache is already running (/var/run/httpd2.pid)
>                        done
> 
> # OCF_ROOT=/usr/lib/ocf /usr/lib/ocf/resource.d/heartbeat/apache monitor
> 
> (no error message, "/usr/lib/ocf/resource.d/heartbeat/apache monitor" is
> showing that Apache is running".)
> 
> 
> So the problem seems to be that Pacemaker starts Apache, immediately
> checks to see if it's running and when it's not running a split second
> later Pacemaker (or more precisely
> /usr/lib/ocf/resource.d/heartbeat/apache) then kills the process without
> waiting for it to start.
> 
> Any suggestions?
> 
> 
> 
> node install0
> node install1
> primitive ClusterIP ocf:heartbeat:IPaddr2 \
>         params ip="192.168.1.24" cidr_netmask="32" \
>         op monitor interval="30s"
> primitive FileSystemDRBD ocf:heartbeat:Filesystem \
>         params device="/dev/drbd0" directory="/home/src" fstype="ext3" \
>         operations $id="FileSystemDRBD-operations" \
>         op start interval="0" timeout="60" \
>         op stop interval="0" timeout="60" \
>         op monitor interval="20" timeout="40" start-delay="0" \
>         op notify interval="0" timeout="60"
> primitive VolumeDRBD ocf:linbit:drbd \
>         params drbd_resource="install" \
>         operations $id="VolumeDRBD-operations" \
>         op start interval="0" timeout="240" \
>         op promote interval="0" timeout="90" \
>         op demote interval="0" timeout="90" \
>         op stop interval="0" timeout="100" \
>         op monitor interval="10" timeout="20" start-delay="0" \
>         op notify interval="0" timeout="90" \
>         meta target-role="started"
> primitive WebSite ocf:heartbeat:apache \
>         operations $id="WebSite-operations" \
>         op start interval="0" timeout="40s" \
>         op stop interval="0" timeout="60s" \
>         op monitor interval="10" timeout="20" start-delay="0" \
>         meta target-role="started"
> group Cluster ClusterIP FileSystemDRBD WebSite \
>         meta target-role="Started"
> ms MasterDRBD VolumeDRBD \
>         meta clone-max="2" notify="true" target-role="started"
> colocation WebServerWithIP inf: Cluster MasterDRBD:Master
> order StartFileSystemFirst inf: MasterDRBD:promote Cluster:start
> property $id="cib-bootstrap-options" \
>         dc-version="1.1.5-ecb6baaf7fc091b023d6d4ba7e0fce26d32cf5c8" \
>         cluster-infrastructure="openais" \
>         expected-quorum-votes="2" \
>         stonith-enabled="false" \
>         no-quorum-policy="ignore" \
>         last-lrm-refresh="1320896853"
> 
> 
> 
> On 11/09/2011 06:43 PM, Earl Ruby wrote:
>> I've set up a 2-node Corosync cluster with Master/Slave DRBD, ClusterIP,
>> a Filesystem resource, and Apache.
>>
>> Everything works fine except Apache. I can start Apache from the command
>> line just fine, but when I shut it off on both nodes and then run:
>>
>> crm resource cleanup WebSite
>>
>> It fails to start. The Apache error_log on both nodes shows two lines
>> each time I run cleanup:
>>
>> [Thu Nov 10 02:37:33 2011] [notice] Apache/2.2.17 (Linux/SUSE)
>> mod_ssl/2.2.17 OpenSSL/1.0.0c mod_perl/2.0.5 Perl/v5.12.3 configured --
>> resuming normal operations
>> [Thu Nov 10 02:37:34 2011] [notice] caught SIGTERM, shutting down
>>
>> "grep -i apache /var/log/corosync.log" gives no useful info.
>>
>> Any idea on what command Pacemaker uses to start Apache? As I said, *I*
>> can start it from the command line no problem, but Pacemaker fails.
>>
>> Any suggestions on how I should go about troubleshooting this? What I
>> should be looking at?

default monitor is requesting the status url of apache ... so typically
mod_status is not enabled and therefor the monitoring fails.

Either enable mod_status for local requests or change the "statusurl"
parameter.

ocf ra info / man ocf_heartbeat_apache ... are your friend ;-)

Regards,
Andreas

-- 
Need help with Pacemaker/Corosync/DRBD?
http://www.hastexo.com/now

>>
>> My config looks like this:
>>
>> node install0
>> node install1
>> primitive ClusterIP ocf:heartbeat:IPaddr2 \
>>         params ip="192.168.1.24" cidr_netmask="32" \
>>         op monitor interval="30s"
>> primitive FileSystemDRBD ocf:heartbeat:Filesystem \
>>         params device="/dev/drbd0" directory="/home/src" fstype="ext3" \
>>         op monitor interval="60" timeout="40" start-delay="10" \
>>         op start interval="0" timeout="60" \
>>         op stop interval="0" timeout="60"
>> primitive VolumeDRBD ocf:linbit:drbd \
>>         params drbd_resource="install" \
>>         operations $id="VolumeDRBD-operations" \
>>         op start interval="0" timeout="240" \
>>         op promote interval="0" timeout="90" \
>>         op demote interval="0" timeout="90" \
>>         op stop interval="0" timeout="100" \
>>         op monitor interval="10" timeout="20" start-delay="0" \
>>         op notify interval="0" timeout="90" \
>>         meta target-role="started"
>> primitive WebSite ocf:heartbeat:apache \
>>         params configfile="/etc/apache2/httpd.conf" \
>>         op monitor interval="1min"
>> group Cluster ClusterIP FileSystemDRBD WebSite \
>>         meta target-role="Started"
>> ms MasterDRBD VolumeDRBD \
>>         meta clone-max="2" notify="true" target-role="started"
>> colocation WebServerWithIP inf: Cluster MasterDRBD:Master
>> order StartFileSystemFirst inf: MasterDRBD:promote Cluster:start
>> property $id="cib-bootstrap-options" \
>>         dc-version="1.1.5-ecb6baaf7fc091b023d6d4ba7e0fce26d32cf5c8" \
>>         cluster-infrastructure="openais" \
>>         expected-quorum-votes="2" \
>>         stonith-enabled="false" \
>>         no-quorum-policy="ignore" \
>>         last-lrm-refresh="1320891100"
>>
> 

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss