Re: Troubleshooting methods for failed process

Earl Ruby <eruby@xxxxxxxxxx> · Wed, 09 Nov 2011 20:05:02 -0800

After thinking about it a bit more I noticed that the Apache log shows
the "caught SIGTERM, shutting down" message 1 second after the start
message, so I thought maybe Pacemaker wasn't allowing Apache enough time
to start, so I manually set the timeout for the start operation to 40s
(by default it should be 40s already) (see bottom of message for my config).

This did not fix the problem.

I did find /usr/lib/ocf/resource.d/heartbeat/apache, which is what
Pacemaker uses to start, stop, and monitor Apache. When I run it
manually, to start Apache, "waiting for apache /etc/apache2/httpd.conf
to come up" is followed IMMEDIATELY by the kill attempt. It does not
wait 40s for the start to timeout:

# OCF_ROOT=/usr/lib/ocf /usr/lib/ocf/resource.d/heartbeat/apache start
apache[27172]: INFO: apache not running
apache[27172]: INFO: waiting for apache /etc/apache2/httpd.conf to come up
/usr/lib/ocf/resource.d/heartbeat/apache: line 440: kill: (27389) - No
such process
apache[27172]: INFO: Killing apache PID 27389
apache[27172]: INFO: apache stopped.

If I try to monitor Apache while it's off:

# OCF_ROOT=/usr/lib/ocf /usr/lib/ocf/resource.d/heartbeat/apache monitor
apache[30211]: INFO: apache not running

... which is correct. If I then manually start Apache and then run
"monitor" it shows that it's running, so Pacemaker *could* tell that
Apache is running if it was working right:

# rcapache2 start
Starting httpd2 (prefork)
                       done

# rcapache2 start
Apache is already running (/var/run/httpd2.pid)
                       done

# OCF_ROOT=/usr/lib/ocf /usr/lib/ocf/resource.d/heartbeat/apache monitor

(no error message, "/usr/lib/ocf/resource.d/heartbeat/apache monitor" is
showing that Apache is running".)

So the problem seems to be that Pacemaker starts Apache, immediately
checks to see if it's running and when it's not running a split second
later Pacemaker (or more precisely
/usr/lib/ocf/resource.d/heartbeat/apache) then kills the process without
waiting for it to start.

Any suggestions?

node install0
node install1
primitive ClusterIP ocf:heartbeat:IPaddr2 \
        params ip="192.168.1.24" cidr_netmask="32" \
        op monitor interval="30s"
primitive FileSystemDRBD ocf:heartbeat:Filesystem \
        params device="/dev/drbd0" directory="/home/src" fstype="ext3" \
        operations $id="FileSystemDRBD-operations" \
        op start interval="0" timeout="60" \
        op stop interval="0" timeout="60" \
        op monitor interval="20" timeout="40" start-delay="0" \
        op notify interval="0" timeout="60"
primitive VolumeDRBD ocf:linbit:drbd \
        params drbd_resource="install" \
        operations $id="VolumeDRBD-operations" \
        op start interval="0" timeout="240" \
        op promote interval="0" timeout="90" \
        op demote interval="0" timeout="90" \
        op stop interval="0" timeout="100" \
        op monitor interval="10" timeout="20" start-delay="0" \
        op notify interval="0" timeout="90" \
        meta target-role="started"
primitive WebSite ocf:heartbeat:apache \
        operations $id="WebSite-operations" \
        op start interval="0" timeout="40s" \
        op stop interval="0" timeout="60s" \
        op monitor interval="10" timeout="20" start-delay="0" \
        meta target-role="started"
group Cluster ClusterIP FileSystemDRBD WebSite \
        meta target-role="Started"
ms MasterDRBD VolumeDRBD \
        meta clone-max="2" notify="true" target-role="started"
colocation WebServerWithIP inf: Cluster MasterDRBD:Master
order StartFileSystemFirst inf: MasterDRBD:promote Cluster:start
property $id="cib-bootstrap-options" \
        dc-version="1.1.5-ecb6baaf7fc091b023d6d4ba7e0fce26d32cf5c8" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        stonith-enabled="false" \
        no-quorum-policy="ignore" \
        last-lrm-refresh="1320896853"

On 11/09/2011 06:43 PM, Earl Ruby wrote:
> I've set up a 2-node Corosync cluster with Master/Slave DRBD, ClusterIP,
> a Filesystem resource, and Apache.
> 
> Everything works fine except Apache. I can start Apache from the command
> line just fine, but when I shut it off on both nodes and then run:
> 
> crm resource cleanup WebSite
> 
> It fails to start. The Apache error_log on both nodes shows two lines
> each time I run cleanup:
> 
> [Thu Nov 10 02:37:33 2011] [notice] Apache/2.2.17 (Linux/SUSE)
> mod_ssl/2.2.17 OpenSSL/1.0.0c mod_perl/2.0.5 Perl/v5.12.3 configured --
> resuming normal operations
> [Thu Nov 10 02:37:34 2011] [notice] caught SIGTERM, shutting down
> 
> "grep -i apache /var/log/corosync.log" gives no useful info.
> 
> Any idea on what command Pacemaker uses to start Apache? As I said, *I*
> can start it from the command line no problem, but Pacemaker fails.
> 
> Any suggestions on how I should go about troubleshooting this? What I
> should be looking at?
> 
> My config looks like this:
> 
> node install0
> node install1
> primitive ClusterIP ocf:heartbeat:IPaddr2 \
>         params ip="192.168.1.24" cidr_netmask="32" \
>         op monitor interval="30s"
> primitive FileSystemDRBD ocf:heartbeat:Filesystem \
>         params device="/dev/drbd0" directory="/home/src" fstype="ext3" \
>         op monitor interval="60" timeout="40" start-delay="10" \
>         op start interval="0" timeout="60" \
>         op stop interval="0" timeout="60"
> primitive VolumeDRBD ocf:linbit:drbd \
>         params drbd_resource="install" \
>         operations $id="VolumeDRBD-operations" \
>         op start interval="0" timeout="240" \
>         op promote interval="0" timeout="90" \
>         op demote interval="0" timeout="90" \
>         op stop interval="0" timeout="100" \
>         op monitor interval="10" timeout="20" start-delay="0" \
>         op notify interval="0" timeout="90" \
>         meta target-role="started"
> primitive WebSite ocf:heartbeat:apache \
>         params configfile="/etc/apache2/httpd.conf" \
>         op monitor interval="1min"
> group Cluster ClusterIP FileSystemDRBD WebSite \
>         meta target-role="Started"
> ms MasterDRBD VolumeDRBD \
>         meta clone-max="2" notify="true" target-role="started"
> colocation WebServerWithIP inf: Cluster MasterDRBD:Master
> order StartFileSystemFirst inf: MasterDRBD:promote Cluster:start
> property $id="cib-bootstrap-options" \
>         dc-version="1.1.5-ecb6baaf7fc091b023d6d4ba7e0fce26d32cf5c8" \
>         cluster-infrastructure="openais" \
>         expected-quorum-votes="2" \
>         stonith-enabled="false" \
>         no-quorum-policy="ignore" \
>         last-lrm-refresh="1320891100"
> 

-- 
Earl C. Ruby III
Director of Engineering
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss