Re: [PATCH 2/3] livepatch: Avoid blocking tasklist_lock too long

Yafang Shao <laoar.shao@xxxxxxxxx> · Tue, 18 Feb 2025 10:37:04 +0800

On Sat, Feb 15, 2025 at 2:12 AM Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
>
> On Fri, Feb 14, 2025 at 03:44:10PM +0100, Petr Mladek wrote:
> > I guess that we really could consider the new task as migrated
> > and clear TIF_PATCH_PENDING.
> >
> > But we can't set child->patch_state to KLP_TRANSITION_IDLE. It won't
> > work when the transition gets reverted. [*]
>
> Hm, why not?  I don't see any difference between patching or unpatching?
>
> klp_init_transition() has a barrier to enforce the order of the
> klp_target_state and func->transition writes, as read by the
> klp_ftrace_handler().
>
> So in the ftrace handler, if func->transition is set and the task is
> KLP_TRANSITION_IDLE, it can use klp_target_state to decide which
> function to use.  Its patch state would effectively be the same as any
> other already-transitioned task, whether it's patching or unpatching.
>
> Then in klp_complete_transition(), after func->transition gets set to
> false, klp_synchronize_transition() flushes out any running ftrace
> handlers.  From that point on, func->transition is false, so the ftrace
> handler would no longer read klp_target_state.
>
> > [*] I gave this few brain cycles but I did not find any elegant
> >     way how to set this a safe way and allow using rcu_read_lock()
> >     in klp_try_complete_transition().
> >
> >     It might be because it is Friday evening and I am leaving for
> >     a trip tomorrow. Also I not motivated enough to think about it
> >     because Yafang saw the RCU stall even with that rcu_read_lock().
> >     So I send this just for record.
>
> Even if it doesn't fix the RCU stalls, I think we should still try to
> avoid holding the tasklist_lock.  It's a global lock which can be
> contended, and we want the livepatch transition to be as unobtrusive as
> possible.

I agree. Since it's feasible to eliminate this global lock, I think
it's worth pursuing this optimization.

>
> If the system is doing a lot of forking across many CPUs, holding the
> lock could block all the forking tasks and trigger system-wide
> scheduling latencies.  And that could be compounded by the unnecessary
> transitioning of new tasks every time the delayed work runs.

--
Regards

Yafang