On Fri, Feb 7, 2025 at 12:43 AM Petr Mladek <pmladek@xxxxxxxx> wrote: > > On Wed 2025-02-05 16:39:11, Yafang Shao wrote: > > On Fri, Jan 31, 2025 at 9:39 PM Petr Mladek <pmladek@xxxxxxxx> wrote: > > > > > > On Fri 2025-01-31 21:22:13, zhang warden wrote: > > > > > > > > > > > > > On Jan 22, 2025, at 20:50, Petr Mladek <pmladek@xxxxxxxx> wrote: > > > > > > > > > > With this patch, any operation which takes the tasklist_lock might > > > > > break klp_try_complete_transition(). I am afraid that this might > > > > > block the transition for a long time on huge systems with some > > > > > specific loads. > > > > > > > > > > And the problem is caused by a printk() added just for debugging. > > > > > I wonder if you even use a slow serial port. > > > > > > > > > > You might try to use printk_deferred() instead. Also you might need > > > > > to disable interrupts around the read_lock()/read_unlock() to > > > > > make sure that the console handling will be deferred after > > > > > the tasklist_lock gets released. > > > > > > > > > > Anyway, I am against this patch. > > > > > > > > > > Best Regards, > > > > > Petr > > > > > > > > Hi, Petr. > > > > > > > > I am unfamiliar with the function `rwlock_is_contended`, but it seems this function will not block and just only check the status of the rw_lock. > > > > > > > > If I understand it right, the problem would raise from the `break` which will stop the process of `for_each_process_thread`, right? > > > > > > You got it right. I am afraid that it might create a livelock > > > situation for the livepatch transition. I mean that the check > > > might almost always break on systems with thousands of processes > > > and frequently created/exited processes. It always has > > > to start from the beginning. > > > > It doesn’t start from the beginning, as tasks that have already > > switched over will be skipped. > > To make it clear. The next klp_try_complete_transition() will start > from the beginning but it should be faster because it will quickly > skip already migrated processes. Right? Exactly, that’s precisely what I meant. > > It makes some sense. I agree that checking the stack is relatively > slow operation. > > That said, beware that the full stack is checked only when the process > is in the kernel code: kthread or userspace process calling a syscall. > Other processes should be handled much faster. The ratio of these > processes depends on the type of the load. And I could imagine that > even checking the TIF_PATCH_PENDING might take a long time when > there are thousands of processes. > > > OK, let's make a step from a theory back to the practice: > > You say that this patch helped and worked fine with your > workload. > > It might be the best approach after all. It looks easier then > the hybrid model. And it might be needed even with the hybrid > model. > > If I get it correctly, the email > https://lore.kernel.org/all/CALOAHbBZc6ORGzXwBRwe+rD2=YGf1jub5TEr989_GpK54P2o1A@xxxxxxxxxxxxxx/ > says that you saw the hardlockup even with a relatively simple > livepatch. It modified "only" about 15 functions. That's correct. We’ve only modified 15 functions so far. > > My main concern is how to guarantee a forward progress. I would like > to make sure that klp_try_complete_transition() will eventually > check all processes. > > I would modify the check to something like: > > read_lock(&tasklist_lock); > > timeout = jiffies + HZ; > proceed_pending_processes = 0; > > for_each_process_thread(g, task) { > /* check if this task has already switched over */ > if (task->patch_state == klp_target_state) > continue; > > proceed_pending_processes++; > > if (!klp_try_switch_task(task)) > complete = false; > > /* > * Prevent hardlockup by not blocking tasklist_lock for too long. > * But guarantee the forward progress by making sure at least > * some pending processes were checked. > */ > if (rwlock_is_contended(&tasklist_lock) && > time_after(jiffies, timeout) && > proceed_pending_processes > 100) { > complete = false; > break; > } > } > > read_unlock(&tasklist_lock); > Thanks for your suggestion. I'll deploy this change to our production servers and verify its effectiveness. > > > > Since the task->patch_state is set before the task is added to the > > task list and the child’s patch_state is inherited from the parent, I > > believe we can remove the tasklist_lock and use RCU instead, as > > follows: > > > > diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c > > index 30187b1d8275..1d022f983bbc 100644 > > --- a/kernel/livepatch/transition.c > > +++ b/kernel/livepatch/transition.c > > @@ -399,11 +399,11 @@ void klp_try_complete_transition(void) > > * Usually this will transition most (or all) of the tasks on a system > > * unless the patch includes changes to a very common function. > > */ > > - read_lock(&tasklist_lock); > > + read_rcu_lock(); > > for_each_process_thread(g, task) > > if (!klp_try_switch_task(task)) > > complete = false; > > - read_unlock(&tasklist_lock); > > + read_rcu_unlock(); > > IMHO, this does not guarantee that we checked all processes in the > cycle. > > I mean: > > We already have a problem that tasklist_lock is not enough to > serialize livepatches modifying do_exit(). It creates a race window > when the process still might be scheduled but it is not longer visible > in the for_each_process_thread() cycle. > > And using read_rcu_lock() will make the race window even bigger. > I mean: > > + with read_lock(&tasklist_lock) the race window is limited by > > + read_lock(&tasklist_lock) in klp_try_complete_transition() > + write_lock_irq(&tasklist_lock) in the middle of do_exit() > > + with read_rcu_lock() the race window is unlimited > > I mean that more processes might get removed from the list > when klp_try_complete_transition() is running when they > are not serialized via the tasklist_lock. As a result, more > processes might be scheduled without being seen > by for_each_process_thread() cycle. > > Does it make sense? That makes sense. Thanks for the detailed explanation. -- Regards Yafang