Re: [PATCH v4 09/17] watchdog/hardlockup: Add a "cpu" param to watchdog_hardlockup_check()

Doug Anderson <dianders@xxxxxxxxxxxx> · Fri, 19 May 2023 10:21:50 -0700

Hi,

On Thu, May 11, 2023 at 7:14 AM Petr Mladek <pmladek@xxxxxxxx> wrote:
>
> On Thu 2023-05-04 15:13:41, Douglas Anderson wrote:
> > In preparation for the buddy hardlockup detector where the CPU
> > checking for lockup might not be the currently running CPU, add a
> > "cpu" parameter to watchdog_hardlockup_check().
> >
> > --- a/kernel/watchdog.c
> > +++ b/kernel/watchdog.c
> > @@ -92,14 +92,14 @@ static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts_saved);
> >  static DEFINE_PER_CPU(bool, watchdog_hardlockup_processed);
> >  static unsigned long watchdog_hardlockup_dumped_stacks;
> >
> > -static bool watchdog_hardlockup_is_lockedup(void)
> > +static bool watchdog_hardlockup_is_lockedup(unsigned int cpu)
> >  {
> > -     unsigned long hrint = __this_cpu_read(hrtimer_interrupts);
> > +     unsigned long hrint = per_cpu(hrtimer_interrupts, cpu);
>
> My radar tells me that this should be
> READ_ONCE(per_cpu(hrtimer_interrupts, cpu)) when the value might
> be modified on another CPU. Otherwise, the compiler is allowed
> to split the read into more instructions.
>
> It will be needed for the buddy detector. And it will require
> also incrementing the value in watchdog_hardlockup_interrupt_count()
> an atomic way.
>
> Note that __this_cpu_inc_return() does not guarantee atomicity
> according to my understanding. In theory, the following should
> work because counter will never be incremented in parallel:
>
> static unsigned long watchdog_hardlockup_interrupt_count(void)
> {
>         unsigned long count;
>
>         count = __this_cpu_read(hrtimer_interrupts);
>         count++;
>         WRITE_ONCE(*raw_cpu_ptr(hrtimer_interrupts), count);
> }
>
> but it is nasty. A more elegant solution might be using atomic_t
> for hrtimer_interrupts counter.

I switched it over to atomic_t.

> > -     if (__this_cpu_read(hrtimer_interrupts_saved) == hrint)
> > +     if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint)
> >               return true;
> >
> > -     __this_cpu_write(hrtimer_interrupts_saved, hrint);
> > +     per_cpu(hrtimer_interrupts_saved, cpu) = hrint;
>
> IMHO, hrtimer_interrupts_saved might be handled this way.
> The value is read/written only by this function.
>
> The buddy watchdog should see consistent values even when
> the buddy CPU goes offline. This check should never race
> because this CPU should get touched when another buddy
> gets assigned.
>
> Well, it would deserve a comment.

I spent a bunch of time thinking about this too and I agree that for
hrtimer_interrupts_saved we don't need atomic_t nor even
READ_ONCE/WRITE_ONCE. I've add a comment and a note in the commit
message in v5.