Later activation of the HW watchdog

jan.kundrat@xxxxxxxxx (Jan Kundrát) · Thu, 14 Jun 2018 19:10:36 +0200

On ÃºterÃ½ 24. Å?Ãjna 2017 17:10:39 CEST, Jan KundrÃ¡t wrote:
> Hi,
> is it possible to change systemd's global settings for 
> RuntimeWatchdogSec at runtime? I would like to have the early 
> boot "guarded" by the HW watchdog started by my platform code, 
> and for systemd to take over only after a certain target has 
> been reached. I was thinking about an extra unit which simply 
> writes an appropriate config file, but the docs for `systemctl 
> daemon-reload` or `daemon-reexec` do not talk about these 
> top-level settins. How do I tell systemd to notice a new value?
>
> Context: I'm using systemd on an embedded ARM box with reliable 
> network connectivity. The system has two fully separate 
> rootfs/kernel/devicetree instances, A and B. The bootloader 
> starts a HW watchdog timer, and the bootloader keeps a counter 
> tracking of how many times a particular A/B "boot slot" 
> attempted to boot. The kernel ignores the watchdog, and once 
> systemd gets launched and checks it system.conf file, it 
> proceeds to re-start the WD timer periodically. Finally, a unit 
> which is pulled in by my default target updates the bootloader's 
> environment, resetting the boot counter.
>
> My goal is to be able to boot a possibly broken image (but not 
> a malicious one, of course) without fearing that it's going to 
> lock me out of my device. If the new image "fails" for some 
> reason, I epxect the HW watchdog to reset the system, the boot 
> attempt counter to eventually reach zero, and the whole system 
> to roll-back to the previous image, eventually. In my scneario, 
> it's preferred to make the decision to reboot rather than 
> waiting for human interaction for solving the actual problem. 
> The once-failed slot can be re-flahed very cheapily, and an 
> updated version can be re-tried during the next update attempt.
>
> During my testing, I was able to unplug the system's SD card at 
> a "wrong" moment which resulted in systemd trying to boot into 
> emergency.target and ultimately failing due to a missing rootfs. 
> I ended up with an unusable system which did not reboot 
> automatically because systemd was periodically pinging the HW 
> watchdog timer. [1]
>
> I got a suggestion to adjust the important units so that they 
> specify a FailureAction. I do not like that solution because it 
> is additional work (identifying which units might fail, coming 
> up with various possible failing scenarios, being hard to test 
> and get "right" in face of systemd updates in future, etc). It 
> also feels like I am attacking a wrong problem. I already *have* 
> a watchdog which will shoot the system into the head if 
> something wrong happens. Wouldn't it make more sense to rely on 
> this piece of infrastructure and start telling the watchdog 
> "hey, I'm OK" only after the system has fuly booted and my 
> ultimate target has been *reached*?
>
> SUggestions which offer additional possibilities are welcome. I 
> like system'd feature set, and I won't pretend that I know all 
> of them :).
>
> With kind regards,
> Jan
>
> [1] https://github.com/systemd/systemd/issues/7063

I more or less solved this by *not* configuring systemd to start pinging 
the watchdog on its own. Then I added another unit depending on and being 
wanted by multi-user.target which checks whether everything is OK so far:

  [Unit]
  Description=Pinging the HW watchdog
  Requires=multi-user.target
  After=multi-user.target

  [Service]
  Type=oneshot
  ExecStartPre=/bin/sh -c '[ "$(/bin/systemctl list-units --failed --all 
--no-legend --no-pager)" == "" ]'
  ExecStart=/bin/busctl set-property org.freedesktop.systemd1 
/org/freedesktop/systemd1 org.freedesktop.systemd1.Manager 
RuntimeWatchdogUSec t 30000000

For more details, see the original bugreport at 
https://github.com/systemd/systemd/issues/7063 .

Cheers,
Jan