Re: Find root of the stall: was: Re: [PATCH 2/3] livepatch: Avoid blocking tasklist_lock too long

Yafang Shao <laoar.shao@xxxxxxxxx> · Fri, 14 Feb 2025 10:44:59 +0800

On Thu, Feb 13, 2025 at 8:39 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
>
> On Thu, Feb 13, 2025 at 8:32 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> >
> > On Thu, Feb 13, 2025 at 7:19 PM Petr Mladek <pmladek@xxxxxxxx> wrote:
> > >
> > > On Tue 2025-02-11 14:24:36, Yafang Shao wrote:
> > > > I encountered a hard lockup when attempting to reproduce the panic issue
> > > > occurred on our production servers [0]. The hard lockup is as follows,
> > > >
> > > > [15852778.150191] livepatch: klp_try_switch_task: grpc_executor:421106 is sleeping on function do_exit
> > > > [15852778.169471] livepatch: klp_try_switch_task: grpc_executor:421244 is sleeping on function do_exit
> > > > [15852778.188746] livepatch: klp_try_switch_task: grpc_executor:421457 is sleeping on function do_exit
> > > > [15852778.208021] livepatch: klp_try_switch_task: grpc_executor:422407 is sleeping on function do_exit
> > > > [15852778.227292] livepatch: klp_try_switch_task: grpc_executor:423184 is sleeping on function do_exit
> > > > [15852778.246576] livepatch: klp_try_switch_task: grpc_executor:423582 is sleeping on function do_exit
> > > > [15852778.265863] livepatch: klp_try_switch_task: grpc_executor:423738 is sleeping on function do_exit
> > > > [15852778.285149] livepatch: klp_try_switch_task: grpc_executor:423739 is sleeping on function do_exit
> > > > [15852778.304446] livepatch: klp_try_switch_task: grpc_executor:423833 is sleeping on function do_exit
> > > > [15852778.323738] livepatch: klp_try_switch_task: grpc_executor:423893 is sleeping on function do_exit
> > > > [15852778.343017] livepatch: klp_try_switch_task: grpc_executor:423894 is sleeping on function do_exit
> > > > [15852778.362292] livepatch: klp_try_switch_task: grpc_executor:423976 is sleeping on function do_exit
> > > > [15852778.381565] livepatch: klp_try_switch_task: grpc_executor:423977 is sleeping on function do_exit
> > > > [15852778.400847] livepatch: klp_try_switch_task: grpc_executor:424610 is sleeping on function do_exit
> > >
> > > This message does not exist in vanilla kernel. It looks like an extra
> > > debug message. And many extra messages might create stalls on its own.
> >
> > Right, the dynamic_debug is enabled:
> >
> >   $ echo 'file kernel/* +p' > /sys/kernel/debug/dynamic_debug/control
> >
> > >
> > > AFAIK, your reproduced the problem even without these extra messages.
> >
> > There are two issues during the KLP transition:
> > 1. RCU warnings
> > 2. hard lockup
> >
> > RCU stalls can be easily reproduced without the extra messages.
> > However, hard lockups cannot be reproduced unless dynamic debugging is
> > enabled.
> >
> > > At least, I see the following at
> > > https://lore.kernel.org/r/CALOAHbB8j6RrpJAyRkzPx2U6YhjWEipRspoQQ_7cvQ+M0zgdXg@xxxxxxxxxxxxxx
> >
> > That's correct, this is related to the RCU warnings issue.
> >
> > >
> > > <paste>
> > > [20329703.332453] livepatch: enabling patch 'livepatch_61_release6'
> > > [20329703.340417] livepatch: 'livepatch_61_release6': starting
> > > patching transition
> > > [20329715.314215] rcu_tasks_wait_gp: rcu_tasks grace period 1109765 is
> > > 10166 jiffies old.
> > > [20329737.126207] rcu_tasks_wait_gp: rcu_tasks grace period 1109769 is
> > > 10219 jiffies old.
> > > [20329752.018236] rcu_tasks_wait_gp: rcu_tasks grace period 1109773 is
> > > 10199 jiffies old.
> > > [20329754.848036] livepatch: 'livepatch_61_release6': patching complete
> > > </paste>
> > >
> > > Could you please confirm that this the original _non-filtered_ log?
> >
> > Right.
> >
> > > I mean that the debug messages were _not_ printed and later filtered?
> >
> > Right.
> >
> > >
> > > I would like to know more about the system where RCU reported the
> > > stall. How many processes are running there in average?
> > > A rough number is enough. I wonder if it is about 1000, 10000, or
> > > 50000?
> >
> > Most of the servers have between 5,000 and 10,000 threads.
> >
> > >
> > > Also what is the CONFIG_HZ value, please?
> >
> > CONFIG_HZ_PERIODIC=y
> > CONFIG_HZ_1000=y
> > CONFIG_HZ=1000
> >
> > >
> > > Also we should get some statistics how long klp_try_switch_task()
> > > lasts in average. I have never did it but I guess that
> > > it should be rather easy with perf. Or maybe just by looking
> > > at function_graph trace.
> >
> > Currently, on one of my production servers, CPU usage is around 40%,
> > and the number of threads is approximately 9,100. The livepatch is
> > applied every 5 seconds. We can adjust the frequency, but the
> > difference won't be significant, as RCU stalls are easy to reproduce
> > even at very low frequencies.
> > The duration of klp_try_switch_task() is as follows:
> >
> > $ /usr/share/bcc/tools/funclatency klp_try_switch_task.part.0 -d 60
> > Tracing 1 functions for "klp_try_switch_task.part.0"... Hit Ctrl-C to end.
> >      nsecs               : count     distribution
> >          0 -> 1          : 0        |                                        |
> >          2 -> 3          : 0        |                                        |
> >          4 -> 7          : 0        |                                        |
> >          8 -> 15         : 0        |                                        |
> >         16 -> 31         : 0        |                                        |
> >         32 -> 63         : 0        |                                        |
> >         64 -> 127        : 0        |                                        |
> >        128 -> 255        : 0        |                                        |
> >        256 -> 511        : 0        |                                        |
> >        512 -> 1023       : 1        |                                        |
> >       1024 -> 2047       : 26665    |***********                             |
> >       2048 -> 4095       : 93834    |****************************************|
> >       4096 -> 8191       : 2695     |*                                       |
> >       8192 -> 16383      : 268      |                                        |
> >      16384 -> 32767      : 24       |                                        |
> >      32768 -> 65535      : 2        |                                        |
> >
> > avg = 2475 nsecs, total: 305745369 nsecs, count: 123489
> >
> > >
> > > I would like to be more sure that klp_try_complete_transition() really
> > > could block RCU for that long. I would like to confirm that
> > > the following is the reality:
> > >
> > >   num_processes * average_klp_try_switch_task > 10second
> >
> > 9100 * 2.475 / 1000 is around 22 second
>
> It looks like I misread the nsec as usec, so the total average
> duration is actually around 22ms. ;)
> I’ll double-check it to confirm.

I've confirmed that during RCU stalls, the average_klp_try_switch_task
value isn't unusually large. However, the real issue lies in the
extended duration of the klp_try_complete_transition().

- The RCU stall

[Fri Feb 14 10:14:56 2025] livepatch: enabling patch 'livepatch_61_release6'
[Fri Feb 14 10:14:56 2025] livepatch: 'livepatch_61_release6':
starting patching transition
[Fri Feb 14 10:15:10 2025] rcu_tasks_wait_gp: rcu_tasks grace period
32001 is 10073 jiffies old.
[Fri Feb 14 10:15:14 2025] livepatch: 'livepatch_61_release6': patching complete

- klp_try_switch_task
10:17:50
     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 2        |                                        |
       256 -> 511        : 5        |                                        |
       512 -> 1023       : 3        |                                        |
      1024 -> 2047       : 806      |*                                       |
      2048 -> 4095       : 30382    |****************************************|
      4096 -> 8191       : 10806    |**************                          |
      8192 -> 16383      : 65       |                                        |
     16384 -> 32767      : 9        |                                        |

avg = 3562 nsecs, total: 149912461 nsecs, count: 42078

The avg of klp_try_switch_task is only 3.5us.

- number of threads

Fri Feb 14 10:14:48
12106
Fri Feb 14 10:14:59
11900
Fri Feb 14 10:15:10
12004
Fri Feb 14 10:15:20
11899
Fri Feb 14 10:15:31
12293
Fri Feb 14 10:15:42
11831

At that point, the system has around 12,000 threads, so the total
duration comes out to approximately 12,000 * 3.5µs, or about 42ms—well
below the 10s threshold.
However, the duration of klp_try_complete_transition() exceeds 10 seconds.

10:15:41
               nsecs                         : count     distribution
                   0 -> 1                    : 0        |                    |
                   2 -> 3                    : 0        |                    |
                   4 -> 7                    : 0        |                    |
                   8 -> 15                   : 0        |                    |
                  16 -> 31                   : 0        |                    |
                  32 -> 63                   : 0        |                    |
                  64 -> 127                  : 0        |                    |
                 128 -> 255                  : 0        |                    |
                 256 -> 511                  : 0        |                    |
                 512 -> 1023                 : 0        |                    |
                1024 -> 2047                 : 0        |                    |
                2048 -> 4095                 : 0        |                    |
                4096 -> 8191                 : 0        |                    |
                8192 -> 16383                : 0        |                    |
               16384 -> 32767                : 0        |                    |
               32768 -> 65535                : 0        |                    |
               65536 -> 131071               : 0        |                    |
              131072 -> 262143               : 0        |                    |
              262144 -> 524287               : 0        |                    |
              524288 -> 1048575              : 0        |                    |
             1048576 -> 2097151              : 0        |                    |
             2097152 -> 4194303              : 0        |                    |
             4194304 -> 8388607              : 0        |                    |
             8388608 -> 16777215             : 0        |                    |
            16777216 -> 33554431             : 0        |                    |
            33554432 -> 67108863             : 1        |********************|
            67108864 -> 134217727            : 1        |********************|
           134217728 -> 268435455            : 0        |                    |
           268435456 -> 536870911            : 0        |                    |
           536870912 -> 1073741823           : 0        |                    |
          1073741824 -> 2147483647           : 0        |                    |
          2147483648 -> 4294967295           : 0        |                    |
          4294967296 -> 8589934591           : 1        |********************|
          8589934592 -> 17179869183          : 1        |********************|

avg = 5630258522 nsecs, total: 22521034091 nsecs, count: 4

The longest duration of klp_try_complete_transition() ranges from 8.5
to 17.2 seconds.

It appears that the RCU stall is not only driven by num_processes *
average_klp_try_switch_task, but also by contention within
klp_try_complete_transition(), particularly around the tasklist_lock.
Interestingly, even after replacing "read_lock(&tasklist_lock)" with
"rcu_read_lock()", the RCU stall persists. My verification shows that
the only way to prevent the stall is by checking need_resched() during
each iteration of the loop.

--
Regards
Yafang