On 11/25/2013 08:50 AM, Hannes Reinecke wrote:
On 11/22/2013 11:17 PM, Benjamin Marzinski wrote:
[ .. ]
I'm not asking for systemd to actually shut down multipathd. In a
production setup, killing multipathd because it had a temporary stall
seems like bad default behavior. I haven't looked at the systemd
watchdog code to know if this is possible, but ideally, multipathd would
be able to just start sending watchdog notifications again, and be able
to continue on with just a message in the logs recording the timeout.
Not stopping. Restarting.
The whole point of the watchdog code is to take some action if the
watchdog messages fail.
We should aim for
a) make the watchdog interval the longest interval we're prepared to
checkerloop to complete (hence the patch to measure the elapsed
time per loop iteration)
b) have systemd restart multipathd whenever the watchdog triggers,
as then we're sure we can't recover from this.
That should cover your sentiment, right?
I realize that there is a benefit to letting people know that there was
a problem, but the way it's appearing now, it will be pretty confusing to
the sysadmin who sees that, and filling up the logs with notification
rejections is pretty annoying.
Yeah, correct. We should be using the 'restart' flag in the service
file. I did not do this as the patch went into systemd only
recently, and one would need to figure out how to treat
installations where an older systemd version is running.
And it also looks as if we'd be tripping over RH bug#982379, where
the watchdog fails to shutdown a process properly.
Which apparently is fixed in 206.
So we'd need a recent systemd for that to work properly.
I'm _quite_ sure there are errors in earlier versions, where the
watchdog feature just causes a new process to be started, without
terminating the old one. _Very_ annoying.
I'll retest with latest systemd. And make the watchdog feature
selective on the systemd version.
Cheers,
Hannes
--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel