On Mon, Oct 21, 2019 at 02:32:08PM -0700, Vito Caputo wrote: > On Mon, Oct 21, 2019 at 05:50:44PM +0000, Zbigniew Jędrzejewski-Szmek wrote: > > In principle, the watchdog for services is nice. But in practice it seems > > be bring only grief. The Fedora bugtracker is full of automated reports of ABRTs, > > and of those that were fired by the watchdog, pretty much 100% are bogus, in > > the sense that the machine was resource starved and the watchdog fired. > > > > There a few downsides to the watchdog killing the service: > > 1. if it is something like logind, it is possible that it will cause user-visible > > failure of other services > > 2. restarting of the service causes additional load on the machine > > 3. coredump handling causes additional load on the machine, quite significant > > 4. those failures are reported in bugtrackers and waste everyone's time. > > > > I had the following ideas: > > 1. disable coredumps for watchdog abrts: systemd could set some flag > > on the unit or otherwise notify systemd-coredump about this, and it could just > > log the occurence but not dump the core file. > > 2. generally disable watchdogs and make them opt in. We have 'systemd-analyze service-watchdogs', > > and we could make the default configurable to "yes|no". > > > > What do you think? > > Zbyszek > > > I think the main issue is the watchdog timeout hasn't been tuned > appropriately for the environment it's being applied. > > It's as if the timeouts are somewhere near the hard real-time > expectations end of the spectrum, while being applied to > non-deterministically delayed and scheduled normal priority userspace > processes. It's a sort of impedance mismatch. > > I /think/ the purpose of the watchdog is to detect when processes are > permanently wedged, capture their state for debugging, and forcefully > unwedge them. > > That seems perfectly reasonable, but the timeout heuristic being used, > given our non-deterministic scheduling, should be incredibly long by > default. It's not the kind of thing you want false positives on, folks > can always shrink the timeout if they find it's desirable. It is now 3 minutes in all systemd units. Dunno, maybe we should make that 30 minutes. Zbyszek > Without having spent much time thinking about this, I'd lean towards > retaining the watchdogs but making their default timeouts so long a > program would have to be wedged for an hour+ before it triggered. > > At least that way we preserve the passive information gathering of > serious bugs which might otherwise go unnoticed with background/idle > services, improving debugging substantially, but eliminate the problems > you describe resulting from false positives. _______________________________________________ systemd-devel mailing list systemd-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/systemd-devel