Re: [PATCH] livepatch: Avoid hard lockup caused by klp_try_switch_task()

Petr Mladek <pmladek@xxxxxxxx> · Thu, 6 Feb 2025 17:43:22 +0100

On Wed 2025-02-05 16:39:11, Yafang Shao wrote:
> On Fri, Jan 31, 2025 at 9:39 PM Petr Mladek <pmladek@xxxxxxxx> wrote:
> >
> > On Fri 2025-01-31 21:22:13, zhang warden wrote:
> > >
> > >
> > > > On Jan 22, 2025, at 20:50, Petr Mladek <pmladek@xxxxxxxx> wrote:
> > > >
> > > > With this patch, any operation which takes the tasklist_lock might
> > > > break klp_try_complete_transition(). I am afraid that this might
> > > > block the transition for a long time on huge systems with some
> > > > specific loads.
> > > >
> > > > And the problem is caused by a printk() added just for debugging.
> > > > I wonder if you even use a slow serial port.
> > > >
> > > > You might try to use printk_deferred() instead. Also you might need
> > > > to disable interrupts around the read_lock()/read_unlock() to
> > > > make sure that the console handling will be deferred after
> > > > the tasklist_lock gets released.
> > > >
> > > > Anyway, I am against this patch.
> > > >
> > > > Best Regards,
> > > > Petr
> > >
> > > Hi, Petr.
> > >
> > > I am unfamiliar with the function `rwlock_is_contended`, but it seems this function will not block and just only check the status of the rw_lock.
> > >
> > > If I understand it right, the problem would raise from the `break` which will stop the process of `for_each_process_thread`, right?
> >
> > You got it right. I am afraid that it might create a livelock
> > situation for the livepatch transition. I mean that the check
> > might almost always break on systems with thousands of processes
> > and frequently created/exited processes. It always has
> > to start from the beginning.
> 
> It doesn’t start from the beginning, as tasks that have already
> switched over will be skipped.

To make it clear. The next klp_try_complete_transition() will start
from the beginning but it should be faster because it will quickly
skip already migrated processes. Right?

It makes some sense. I agree that checking the stack is relatively
slow operation.

That said, beware that the full stack is checked only when the process
is in the kernel code: kthread or userspace process calling a syscall.
Other processes should be handled much faster. The ratio of these
processes depends on the type of the load. And I could imagine that
even checking the TIF_PATCH_PENDING might take a long time when
there are thousands of processes.

OK, let's make a step from a theory back to the practice:

You say that this patch helped and worked fine with your
workload.

It might be the best approach after all. It looks easier then
the hybrid model. And it might be needed even with the hybrid
model.

If I get it correctly, the email
https://lore.kernel.org/all/CALOAHbBZc6ORGzXwBRwe+rD2=YGf1jub5TEr989_GpK54P2o1A@xxxxxxxxxxxxxx/
says that you saw the hardlockup even with a relatively simple
livepatch. It modified "only" about 15 functions.

My main concern is how to guarantee a forward progress. I would like
to make sure that klp_try_complete_transition() will eventually
check all processes.

I would modify the check to something like:

	read_lock(&tasklist_lock);

	timeout = jiffies + HZ;
	proceed_pending_processes = 0;

	for_each_process_thread(g, task) {
		/* check if this task has already switched over */
		if (task->patch_state == klp_target_state)
			continue;

		proceed_pending_processes++;

		if (!klp_try_switch_task(task))
			complete = false;

		/*
		 * Prevent hardlockup by not blocking tasklist_lock for too long.
		 * But guarantee the forward progress by making sure at least
		 * some pending processes were checked.
		 */
		 if (rwlock_is_contended(&tasklist_lock) &&
		    time_after(jiffies, timeout) &&
		    proceed_pending_processes > 100) {
				complete = false;
				break;
		}
	}

	read_unlock(&tasklist_lock);

> Since the task->patch_state is set before the task is added to the
> task list and the child’s patch_state is inherited from the parent, I
> believe we can remove the tasklist_lock and use RCU instead, as
> follows:
> 
> diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> index 30187b1d8275..1d022f983bbc 100644
> --- a/kernel/livepatch/transition.c
> +++ b/kernel/livepatch/transition.c
> @@ -399,11 +399,11 @@ void klp_try_complete_transition(void)
>          * Usually this will transition most (or all) of the tasks on a system
>          * unless the patch includes changes to a very common function.
>          */
> -       read_lock(&tasklist_lock);
> +       read_rcu_lock();
>         for_each_process_thread(g, task)
>                 if (!klp_try_switch_task(task))
>                         complete = false;
> -       read_unlock(&tasklist_lock);
> +       read_rcu_unlock();

IMHO, this does not guarantee that we checked all processes in the
cycle.

I mean:

We already have a problem that tasklist_lock is not enough to
serialize livepatches modifying do_exit(). It creates a race window
when the process still might be scheduled but it is not longer visible
in the for_each_process_thread() cycle.

And using read_rcu_lock() will make the race window even bigger.
I mean:

  + with read_lock(&tasklist_lock) the race window is limited by

       + read_lock(&tasklist_lock) in klp_try_complete_transition()
       + write_lock_irq(&tasklist_lock) in the middle of do_exit()

  + with read_rcu_lock() the race window is unlimited

I mean that more processes might get removed from the list
when klp_try_complete_transition() is running when they
are not serialized via the tasklist_lock. As a result, more
processes might be scheduled without being seen
by for_each_process_thread() cycle.

Does it make sense?

Best Regards,
Petr