Re: Find root of the stall: was: Re: [PATCH 2/3] livepatch: Avoid blocking tasklist_lock too long

Petr Mladek <pmladek@xxxxxxxx> · Fri, 14 Feb 2025 12:37:13 +0100

On Fri 2025-02-14 00:36:03, Josh Poimboeuf wrote:
> On Fri, Feb 14, 2025 at 10:44:59AM +0800, Yafang Shao wrote:
> > The longest duration of klp_try_complete_transition() ranges from 8.5
> > to 17.2 seconds.
> > 
> > It appears that the RCU stall is not only driven by num_processes *
> > average_klp_try_switch_task, but also by contention within
> > klp_try_complete_transition(), particularly around the tasklist_lock.
> > Interestingly, even after replacing "read_lock(&tasklist_lock)" with
> > "rcu_read_lock()", the RCU stall persists. My verification shows that
> > the only way to prevent the stall is by checking need_resched() during
> > each iteration of the loop.
> 
> I'm confused... rcu_read_lock() shouldn't cause any contention, right?
> So if klp_try_switch_task() isn't the problem, then what is?

I agree that it does not make much sense.

> I wonder if those function timings might be misleading.  If
> klp_try_complete_transition() gets preempted immediately when it
> releases the lock, it could take a while before it eventually returns.
> So that funclatency might not be telling the whole story.

The scheduling might be an explanation.

> Though 8.5 - 17.2 seconds is a bit excessive...

If klp_try_complete_transition() scheduled out and we see this delay
then the system likely had a pretty high load at the moment.
Is it possible?

Yafang, just to be sure. Have you seen these numbers with
the original klp_try_complete_transition() code and with debug
messages disabled?

Or did you saw them with some extra debugging code or other
modifications?

Also just to be sure. Is this on bare metal?

Finally, what preemption mode are you using? Which CONFIG_PREEMPT*?

Best regards,
Petr

PS: JFYI, I have vacation the following week and won't have
    access to mails...