On 11/22/2013 11:17 PM, Benjamin Marzinski wrote: > On Fri, Nov 15, 2013 at 11:29:40AM +0100, Hannes Reinecke wrote: >> In the past there have been several instances where multipathd >> would hang with the checkerloop as some path checker might not >> be able to return in time. >> This patch now activates the watchdog feature from systemd >> to shutdown (and possibly restart) multipathd in these >> situations. > > This might need more of a systemd fix that a multipathd one, but once > multipathd times out the watchdog timer, even if it starts sending > notifications at an acceptable rate again, the service is still listed > as failed. > > # service multipathd status > Redirecting to # /bin/systemctl status multipathd.service > multipathd.service - Device-Mapper Multipath Device Controller > Loaded: loaded (/usr/lib/systemd/system/multipathd.service; enabled) > Active: failed (Result: watchdog) since Fri 2013-11-22 09:43:01 CST; 9min ago > Main PID: 6321 > Status: "running" > CGroup: name=systemd:/system/multipathd.service > └─6321 /sbin/multipathd -d -s > > More annoying, the logs fills up with messages like > > Nov 22 09:46:28 ask-08 systemd[1]: multipathd.service: Got notification > message from PID 6321, but reception only permitted for PID 0 > Nov 22 09:46:29 ask-08 systemd[1]: multipathd.service: Got notification > message from PID 6321, but reception only permitted for PID 0 > Nov 22 09:46:30 ask-08 systemd[1]: multipathd.service: Got notification > message from PID 6321, but reception only permitted for PID 0 > Nov 22 09:46:31 ask-08 systemd[1]: multipathd.service: Got notification > message from PID 6321, but reception only permitted for PID 0 > > Also > > # service multipathd stop > > won't kill it. Even worse > > # service multipathd start > > WILL kill it without successfully restarting another version. A second > > # service multipathd start > > is necessary to get things back to a functional state again. > Actually, upstream systemd (>= v207) now has a new flag restart=on-watchdog With that systemd should be restarting multipathd after a watchdog timeout. That should solve you immediate problem here. > I'm not asking for systemd to actually shut down multipathd. In a > production setup, killing multipathd because it had a temporary stall > seems like bad default behavior. I haven't looked at the systemd > watchdog code to know if this is possible, but ideally, multipathd would > be able to just start sending watchdog notifications again, and be able > to continue on with just a message in the logs recording the timeout. > Not stopping. Restarting. The whole point of the watchdog code is to take some action if the watchdog messages fail. We should aim for a) make the watchdog interval the longest interval we're prepared to checkerloop to complete (hence the patch to measure the elapsed time per loop iteration) b) have systemd restart multipathd whenever the watchdog triggers, as then we're sure we can't recover from this. That should cover your sentiment, right? > I realize that there is a benefit to letting people know that there was > a problem, but the way it's appearing now, it will be pretty confusing to > the sysadmin who sees that, and filling up the logs with notification > rejections is pretty annoying. > Yeah, correct. We should be using the 'restart' flag in the service file. I did not do this as the patch went into systemd only recently, and one would need to figure out how to treat installations where an older systemd version is running. > And as long as I'm asking for systemd things, the ability to add a rule > to the unit file that kills the service and forces a core dump when > watchdog timer was tripped would help tracking down what's stalling the > checker loop. Like I said before, I don't think this should be > happening by default, but putting it in there commented out might not be > a bad idea. > Yeah, that would be preferable. Sadly there is no 'force coredump' option. What I would like to have is a 'on-watchdog' option in systemd, where one can configure the action which needs to be taken when the watchdog triggers. Only adding a new option is touching systemd in tons of various places, so my initial attempt here failed. So I went for the easier option to just add a new flag to an existing setting. Cheers, Hannes P.S.: But hey, at least someone is actually testing this stuff. Cool. -- Dr. Hannes Reinecke zSeries & Storage hare@xxxxxxx +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel