Re: [PATCH 09/13] multipathd: Implement systemd watchdog integration

Benjamin Marzinski <bmarzins@xxxxxxxxxx> · Fri, 22 Nov 2013 16:17:59 -0600

On Fri, Nov 15, 2013 at 11:29:40AM +0100, Hannes Reinecke wrote:
> In the past there have been several instances where multipathd
> would hang with the checkerloop as some path checker might not
> be able to return in time.
> This patch now activates the watchdog feature from systemd
> to shutdown (and possibly restart) multipathd in these
> situations.

This might need more of a systemd fix that a multipathd one, but once
multipathd times out the watchdog timer, even if it starts sending
notifications at an acceptable rate again, the service is still listed
as failed.

# service multipathd status
Redirecting to # /bin/systemctl status multipathd.service
multipathd.service - Device-Mapper Multipath Device Controller
   Loaded: loaded (/usr/lib/systemd/system/multipathd.service; enabled)
   Active: failed (Result: watchdog) since Fri 2013-11-22 09:43:01 CST; 9min ago
 Main PID: 6321
   Status: "running"
   CGroup: name=systemd:/system/multipathd.service
           └─6321 /sbin/multipathd -d -s

More annoying, the logs fills up with messages like

Nov 22 09:46:28 ask-08 systemd[1]: multipathd.service: Got notification
message from PID 6321, but reception only permitted for PID 0
Nov 22 09:46:29 ask-08 systemd[1]: multipathd.service: Got notification
message from PID 6321, but reception only permitted for PID 0
Nov 22 09:46:30 ask-08 systemd[1]: multipathd.service: Got notification
message from PID 6321, but reception only permitted for PID 0
Nov 22 09:46:31 ask-08 systemd[1]: multipathd.service: Got notification
message from PID 6321, but reception only permitted for PID 0

Also

# service multipathd stop

won't kill it. Even worse

# service multipathd start

WILL kill it without successfully restarting another version. A second

# service multipathd start

is necessary to get things back to a functional state again.

I'm not asking for systemd to actually shut down multipathd.  In a
production setup, killing multipathd because it had a temporary stall
seems like bad default behavior.  I haven't looked at the systemd
watchdog code to know if this is possible, but ideally, multipathd would
be able to just start sending watchdog notifications again, and be able
to continue on with just a message in the logs recording the timeout.

I realize that there is a benefit to letting people know that there was
a problem, but the way it's appearing now, it will be pretty confusing to
the sysadmin who sees that, and filling up the logs with notification
rejections is pretty annoying.

And as long as I'm asking for systemd things, the ability to add a rule
to the unit file that kills the service and forces a core dump when
watchdog timer was tripped would help tracking down what's stalling the
checker loop.  Like I said before, I don't think this should be
happening by default, but putting it in there commented out might not be
a bad idea.

-Ben

> 
> Signed-off-by: Hannes Reinecke <hare@xxxxxxx>
> ---
>  multipath/multipath.conf.5    |  7 +++++--
>  multipathd/main.c             | 15 ++++++++++++++-
>  multipathd/multipathd.service |  1 +
>  3 files changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/multipath/multipath.conf.5 b/multipath/multipath.conf.5
> index 0fd3035..cf5bec0 100644
> --- a/multipath/multipath.conf.5
> +++ b/multipath/multipath.conf.5
> @@ -71,8 +71,11 @@ section recognizes the following keywords:
>  .B polling_interval
>  interval between two path checks in seconds. For properly functioning paths,
>  the interval between checks will gradually increase to
> -.B max_polling_interval;
> -default is
> +.B max_polling_interval.
> +This value will be overridden by the
> +.B WatchdogSec
> +setting in the multipathd.service definition if systemd is used.
> +Default is
>  .I 5
>  .TP
>  .B max_polling_interval
> diff --git a/multipathd/main.c b/multipathd/main.c
> index 72b3740..abeebc2 100644
> --- a/multipathd/main.c
> +++ b/multipathd/main.c
> @@ -1286,6 +1286,7 @@ checkerloop (void *ap)
>  		lock(vecs->lock);
>  		pthread_testcancel();
>  		condlog(4, "tick");
> +		sd_notify(0, "WATCHDOG=1");
>  
>  		if (vecs->pathvec) {
>  			vector_foreach_slot (vecs->pathvec, pp, i) {
> @@ -1585,7 +1586,8 @@ child (void * param)
>  	pthread_attr_t log_attr, misc_attr, uevent_attr;
>  	struct vectors * vecs;
>  	struct multipath * mpp;
> -	int i;
> +	char *envp;
> +	int i, checkint;
>  	int rc, pid_rc;
>  
>  	mlockall(MCL_CURRENT | MCL_FUTURE);
> @@ -1658,6 +1660,17 @@ child (void * param)
>  
>  	conf->daemon = 1;
>  	udev_set_sync_support(0);
> +	envp = getenv("WATCHDOG_USEC");
> +	if (envp && sscanf(envp, "%d", &checkint) == 1) {
> +		/* Value is in microseconds */
> +		checkint = checkint / 1000000;
> +		if (checkint > conf->max_checkint)
> +			conf->max_checkint = checkint;
> +		conf->checkint = checkint;
> +		condlog(3, "set checkint to %d max %d",
> +			conf->checkint, conf->max_checkint);
> +	}
> +
>  	/*
>  	 * Start uevent listener early to catch events
>  	 */
> diff --git a/multipathd/multipathd.service b/multipathd/multipathd.service
> index fb84025..848a231 100644
> --- a/multipathd/multipathd.service
> +++ b/multipathd/multipathd.service
> @@ -10,6 +10,7 @@ Type=notify
>  NotifyAccess=main
>  ExecStart=/sbin/multipathd -d -s
>  ExecReload=/sbin/multipathd reconfigure
> +WatchdogSec=5s
>  
>  [Install]
>  WantedBy=sysinit.target
> -- 
> 1.8.1.4
> 
> --
> dm-devel mailing list
> dm-devel@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel