Re: Find root of the stall: was: Re: [PATCH 2/3] livepatch: Avoid blocking tasklist_lock too long

Josh Poimboeuf <jpoimboe@xxxxxxxxxx> · Fri, 14 Feb 2025 00:36:03 -0800

On Fri, Feb 14, 2025 at 10:44:59AM +0800, Yafang Shao wrote:
> The longest duration of klp_try_complete_transition() ranges from 8.5
> to 17.2 seconds.
> 
> It appears that the RCU stall is not only driven by num_processes *
> average_klp_try_switch_task, but also by contention within
> klp_try_complete_transition(), particularly around the tasklist_lock.
> Interestingly, even after replacing "read_lock(&tasklist_lock)" with
> "rcu_read_lock()", the RCU stall persists. My verification shows that
> the only way to prevent the stall is by checking need_resched() during
> each iteration of the loop.

I'm confused... rcu_read_lock() shouldn't cause any contention, right?
So if klp_try_switch_task() isn't the problem, then what is?

I wonder if those function timings might be misleading.  If
klp_try_complete_transition() gets preempted immediately when it
releases the lock, it could take a while before it eventually returns.
So that funclatency might not be telling the whole story.

Though 8.5 - 17.2 seconds is a bit excessive...

-- 
Josh