On Tue, Feb 13, 2024 at 04:23:41PM +0100, Thomas Gleixner wrote: > From: Jiri Wiesner <jwiesner@xxxxxxx> > > commit 644649553508b9bacf0fc7a5bdc4f9e0165576a5 upstream. > > There have been reports of the watchdog marking clocksources unstable on > machines with 8 NUMA nodes: > > clocksource: timekeeping watchdog on CPU373: > Marking clocksource 'tsc' as unstable because the skew is too large: > clocksource: 'hpet' wd_nsec: 14523447520 > clocksource: 'tsc' cs_nsec: 14524115132 > > The measured clocksource skew - the absolute difference between cs_nsec > and wd_nsec - was 668 microseconds: > > cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612 > > The kernel used 200 microseconds for the uncertainty_margin of both the > clocksource and watchdog, resulting in a threshold of 400 microseconds (the > md variable). Both the cs_nsec and the wd_nsec value indicate that the > readout interval was circa 14.5 seconds. The observed behaviour is that > watchdog checks failed for large readout intervals on 8 NUMA node > machines. This indicates that the size of the skew was directly proportinal > to the length of the readout interval on those machines. The measured > clocksource skew, 668 microseconds, was evaluated against a threshold (the > md variable) that is suited for readout intervals of roughly > WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5 second. > > The intention of 2e27e793e280 ("clocksource: Reduce clocksource-skew > threshold") was to tighten the threshold for evaluating skew and set the > lower bound for the uncertainty_margin of clocksources to twice > WATCHDOG_MAX_SKEW. Later in c37e85c135ce ("clocksource: Loosen clocksource > watchdog constraints"), the WATCHDOG_MAX_SKEW constant was increased to > 125 microseconds to fit the limit of NTP, which is able to use a > clocksource that suffers from up to 500 microseconds of skew per second. > Both the TSC and the HPET use default uncertainty_margin. When the > readout interval gets stretched the default uncertainty_margin is no > longer a suitable lower bound for evaluating skew - it imposes a limit > that is far stricter than the skew with which NTP can deal. > > The root causes of the skew being directly proportinal to the length of > the readout interval are: > > * the inaccuracy of the shift/mult pairs of clocksources and the watchdog > * the conversion to nanoseconds is imprecise for large readout intervals > > Prevent this by skipping the current watchdog check if the readout > interval exceeds 2 * WATCHDOG_INTERVAL. Considering the maximum readout > interval of 2 * WATCHDOG_INTERVAL, the current default uncertainty margin > (of the TSC and HPET) corresponds to a limit on clocksource skew of 250 > ppm (microseconds of skew per second). To keep the limit imposed by NTP > (500 microseconds of skew per second) for all possible readout intervals, > the margins would have to be scaled so that the threshold value is > proportional to the length of the actual readout interval. > > As for why the readout interval may get stretched: Since the watchdog is > executed in softirq context the expiration of the watchdog timer can get > severely delayed on account of a ksoftirqd thread not getting to run in a > timely manner. Surely, a system with such belated softirq execution is not > working well and the scheduling issue should be looked into but the > clocksource watchdog should be able to deal with it accordingly. > > Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold") > Suggested-by: Feng Tang <feng.tang@xxxxxxxxx> > Signed-off-by: Jiri Wiesner <jwiesner@xxxxxxx> > Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx> > Tested-by: Paul E. McKenney <paulmck@xxxxxxxxxx> > Reviewed-by: Feng Tang <feng.tang@xxxxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx > Link: https://lore.kernel.org/r/20240122172350.GA740@incl > --- > > Backport to 6.1, 5.15, 5.10 because tglx has too much spare time Hey, I'll take it, thanks! Now queued up. greg k-h