On Mon, Oct 21, 2019 at 05:50:44PM +0000, Zbigniew Jędrzejewski-Szmek wrote: > In principle, the watchdog for services is nice. But in practice it seems > be bring only grief. The Fedora bugtracker is full of automated reports of ABRTs, > and of those that were fired by the watchdog, pretty much 100% are bogus, in > the sense that the machine was resource starved and the watchdog fired. > > There a few downsides to the watchdog killing the service: > 1. if it is something like logind, it is possible that it will cause user-visible > failure of other services > 2. restarting of the service causes additional load on the machine > 3. coredump handling causes additional load on the machine, quite significant > 4. those failures are reported in bugtrackers and waste everyone's time. > > I had the following ideas: > 1. disable coredumps for watchdog abrts: systemd could set some flag > on the unit or otherwise notify systemd-coredump about this, and it could just > log the occurence but not dump the core file. > 2. generally disable watchdogs and make them opt in. We have 'systemd-analyze service-watchdogs', > and we could make the default configurable to "yes|no". > > What do you think? > Zbyszek I think the main issue is the watchdog timeout hasn't been tuned appropriately for the environment it's being applied. It's as if the timeouts are somewhere near the hard real-time expectations end of the spectrum, while being applied to non-deterministically delayed and scheduled normal priority userspace processes. It's a sort of impedance mismatch. I /think/ the purpose of the watchdog is to detect when processes are permanently wedged, capture their state for debugging, and forcefully unwedge them. That seems perfectly reasonable, but the timeout heuristic being used, given our non-deterministic scheduling, should be incredibly long by default. It's not the kind of thing you want false positives on, folks can always shrink the timeout if they find it's desirable. Without having spent much time thinking about this, I'd lean towards retaining the watchdogs but making their default timeouts so long a program would have to be wedged for an hour+ before it triggered. At least that way we preserve the passive information gathering of serious bugs which might otherwise go unnoticed with background/idle services, improving debugging substantially, but eliminate the problems you describe resulting from false positives. Regards, Vito Caputo _______________________________________________ systemd-devel mailing list systemd-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/systemd-devel