Re: [PATCH] livepatch: Avoid hard lockup caused by klp_try_switch_task()

Yafang Shao <laoar.shao@xxxxxxxxx> · Fri, 7 Feb 2025 10:16:43 +0800

On Fri, Feb 7, 2025 at 12:43 AM Petr Mladek <pmladek@xxxxxxxx> wrote:
>
> On Wed 2025-02-05 16:39:11, Yafang Shao wrote:
> > On Fri, Jan 31, 2025 at 9:39 PM Petr Mladek <pmladek@xxxxxxxx> wrote:
> > >
> > > On Fri 2025-01-31 21:22:13, zhang warden wrote:
> > > >
> > > >
> > > > > On Jan 22, 2025, at 20:50, Petr Mladek <pmladek@xxxxxxxx> wrote:
> > > > >
> > > > > With this patch, any operation which takes the tasklist_lock might
> > > > > break klp_try_complete_transition(). I am afraid that this might
> > > > > block the transition for a long time on huge systems with some
> > > > > specific loads.
> > > > >
> > > > > And the problem is caused by a printk() added just for debugging.
> > > > > I wonder if you even use a slow serial port.
> > > > >
> > > > > You might try to use printk_deferred() instead. Also you might need
> > > > > to disable interrupts around the read_lock()/read_unlock() to
> > > > > make sure that the console handling will be deferred after
> > > > > the tasklist_lock gets released.
> > > > >
> > > > > Anyway, I am against this patch.
> > > > >
> > > > > Best Regards,
> > > > > Petr
> > > >
> > > > Hi, Petr.
> > > >
> > > > I am unfamiliar with the function `rwlock_is_contended`, but it seems this function will not block and just only check the status of the rw_lock.
> > > >
> > > > If I understand it right, the problem would raise from the `break` which will stop the process of `for_each_process_thread`, right?
> > >
> > > You got it right. I am afraid that it might create a livelock
> > > situation for the livepatch transition. I mean that the check
> > > might almost always break on systems with thousands of processes
> > > and frequently created/exited processes. It always has
> > > to start from the beginning.
> >
> > It doesn’t start from the beginning, as tasks that have already
> > switched over will be skipped.
>
> To make it clear. The next klp_try_complete_transition() will start
> from the beginning but it should be faster because it will quickly
> skip already migrated processes. Right?

Exactly, that’s precisely what I meant.

>
> It makes some sense. I agree that checking the stack is relatively
> slow operation.
>
> That said, beware that the full stack is checked only when the process
> is in the kernel code: kthread or userspace process calling a syscall.
> Other processes should be handled much faster. The ratio of these
> processes depends on the type of the load. And I could imagine that
> even checking the TIF_PATCH_PENDING might take a long time when
> there are thousands of processes.
>
>
> OK, let's make a step from a theory back to the practice:
>
> You say that this patch helped and worked fine with your
> workload.
>
> It might be the best approach after all. It looks easier then
> the hybrid model. And it might be needed even with the hybrid
> model.
>
> If I get it correctly, the email
> https://lore.kernel.org/all/CALOAHbBZc6ORGzXwBRwe+rD2=YGf1jub5TEr989_GpK54P2o1A@xxxxxxxxxxxxxx/
> says that you saw the hardlockup even with a relatively simple
> livepatch. It modified "only" about 15 functions.

That's correct. We’ve only modified 15 functions so far.

>
> My main concern is how to guarantee a forward progress. I would like
> to make sure that klp_try_complete_transition() will eventually
> check all processes.
>
> I would modify the check to something like:
>
>         read_lock(&tasklist_lock);
>
>         timeout = jiffies + HZ;
>         proceed_pending_processes = 0;
>
>         for_each_process_thread(g, task) {
>                 /* check if this task has already switched over */
>                 if (task->patch_state == klp_target_state)
>                         continue;
>
>                 proceed_pending_processes++;
>
>                 if (!klp_try_switch_task(task))
>                         complete = false;
>
>                 /*
>                  * Prevent hardlockup by not blocking tasklist_lock for too long.
>                  * But guarantee the forward progress by making sure at least
>                  * some pending processes were checked.
>                  */
>                  if (rwlock_is_contended(&tasklist_lock) &&
>                     time_after(jiffies, timeout) &&
>                     proceed_pending_processes > 100) {
>                                 complete = false;
>                                 break;
>                 }
>         }
>
>         read_unlock(&tasklist_lock);
>

Thanks for your suggestion. I'll deploy this change to our production
servers and verify its effectiveness.

>
>
> > Since the task->patch_state is set before the task is added to the
> > task list and the child’s patch_state is inherited from the parent, I
> > believe we can remove the tasklist_lock and use RCU instead, as
> > follows:
> >
> > diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> > index 30187b1d8275..1d022f983bbc 100644
> > --- a/kernel/livepatch/transition.c
> > +++ b/kernel/livepatch/transition.c
> > @@ -399,11 +399,11 @@ void klp_try_complete_transition(void)
> >          * Usually this will transition most (or all) of the tasks on a system
> >          * unless the patch includes changes to a very common function.
> >          */
> > -       read_lock(&tasklist_lock);
> > +       read_rcu_lock();
> >         for_each_process_thread(g, task)
> >                 if (!klp_try_switch_task(task))
> >                         complete = false;
> > -       read_unlock(&tasklist_lock);
> > +       read_rcu_unlock();
>
> IMHO, this does not guarantee that we checked all processes in the
> cycle.
>
> I mean:
>
> We already have a problem that tasklist_lock is not enough to
> serialize livepatches modifying do_exit(). It creates a race window
> when the process still might be scheduled but it is not longer visible
> in the for_each_process_thread() cycle.
>
> And using read_rcu_lock() will make the race window even bigger.
> I mean:
>
>   + with read_lock(&tasklist_lock) the race window is limited by
>
>        + read_lock(&tasklist_lock) in klp_try_complete_transition()
>        + write_lock_irq(&tasklist_lock) in the middle of do_exit()
>
>   + with read_rcu_lock() the race window is unlimited
>
> I mean that more processes might get removed from the list
> when klp_try_complete_transition() is running when they
> are not serialized via the tasklist_lock. As a result, more
> processes might be scheduled without being seen
> by for_each_process_thread() cycle.
>
> Does it make sense?

That makes sense. Thanks for the detailed explanation.

--
Regards
Yafang