Re: "corosync-cfgtool -s" hangs for hours

Jan Friesse <jfriesse@xxxxxxxxxx> · Thu, 10 May 2012 09:25:11 +0200

Sebastian,
corosync 1.1.5 is old, no longer supported version. There were many bug 
fixes between 1.1.x and 1.4.x, so I would recommend to upgrade to 1.4.3 
and see if problem persist.

Regards,
  Honza

Sebastian Kaps napsal(a):
Hi,

we are running a two-node cluster on SLES11 SP1 machines with
HA-Extension and the bundled Corosync+Pacemaker packages.
Corosync is version 1.1.5.

The machines are running for over a year now and despite some
problems we've had with our setup, we never before had the
problem we're now facing:

We're running a small monitoring script that checks the status of both
corosync rings every three minutes and submits the result via mail to
our monitoring server.
It basically wraps the output of "corosync-cfgtool -s" in an email.

Since a few days we see that the "corosync-cfgtool -s" call hangs for
multiple hours and during that time blocks all subsequent calls to
"corosync-cfgtool -s" (or "-r" for that matter) on that system:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 17963 0.0 0.0 14308 704 ? D 07:36 0:00 /usr/sbin/corosync-cfgtool -s
root 18587 0.0 0.0 14308 704 ? D 07:39 0:00 /usr/sbin/corosync-cfgtool -s
root 19363 0.0 0.0 14308 708 ? D 07:42 0:00 /usr/sbin/corosync-cfgtool -s
root 20061 0.0 0.0 14308 704 ? D 07:45 0:00 /usr/sbin/corosync-cfgtool -s
root 20751 0.0 0.0 14308 708 ? D 07:48 0:00 /usr/sbin/corosync-cfgtool -s
root 21409 0.0 0.0 14308 708 ? D 07:51 0:00 /usr/sbin/corosync-cfgtool -s
root 22106 0.0 0.0 14308 704 ? D 07:54 0:00 /usr/sbin/corosync-cfgtool -s
root 22854 0.0 0.0 14316 716 ? D 07:57 0:00 /usr/sbin/corosync-cfgtool -s
root 23634 0.0 0.0 14308 704 ? D 08:00 0:00 /usr/sbin/corosync-cfgtool -s
root 24475 0.0 0.0 14308 708 ? D 08:03 0:00 /usr/sbin/corosync-cfgtool -s
root 25250 0.0 0.0 14308 704 ? D 08:06 0:00 /usr/sbin/corosync-cfgtool -s

After a few hours the piled-up processes vanish and everything works as
expected again,
until it happens again. There's no problem executing "corosync-cfgtool
-s" on the other node.

Has anyone an idea what could cause this? We didn't change anything on
the system's configuration
and the problem just appeared out of the blue...
Also there's no hint in the logs.

Our corosync.conf looks like this:

----- snip -----
aisexec {
group: root
user: root
}
service {
use_mgmtd: yes
ver: 0
name: pacemaker
}
totem {
rrp_mode: passive
join: 100
max_messages: 20
vsftype: none
consensus: 10000
secauth: on
token_retransmits_before_loss_const: 10
threads: 16
token: 10000
version: 2
interface {
bindnetaddr: 192.168.1.0
mcastaddr: 239.250.1.1
mcastport: 5405
ringnumber: 0
}
interface {
bindnetaddr: 194.55.223.0
mcastaddr: 239.250.1.2
mcastport: 5415
ringnumber: 1
}
clear_node_high_bit: yes
}
logging {
to_logfile: no
to_syslog: yes
debug: off
timestamp: off
to_stderr: yes
fileline: off
syslog_facility: daemon
}
amf {
mode: disable
}
----- snip -----

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss