Re: Wakes of the rcuc/ thread on isolated CPUs.

Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx> · Fri, 5 Jul 2024 20:10:35 +0200

On 2024-07-05 10:39:25 [-0700], Paul E. McKenney wrote:
> As a workaround, the following commit in -rcu that is slated for
> the upcoming merge window addresses a similar case involving KVM and
> nohz_full:
> 
> 68d124b09999 ("rcu: Add rcutree.nohz_full_patience_delay to reduce
> nohz_full OS jitter")
> 
> The KVM guys found that setting rcutree.nohz_full_patience_delay to 1000
> (AKA one second) made things work better for them.  Does this help your
> use case?

My problem is that I have a task stuck in percpu_down_write()/
__wait_rcu_gp() and I think this is because the RCU machinery is stuck
and there is no grace period.
I have see a rcuc/ thread with a wakeup but it won't be scheduled
because it's priority is lower than the thread that is currently on the
CPU and that thread uses at 100%.
I *think* this explains it because the rcuc moves the grace period
forward.
Looking at the patch, there would be a delay up to 5 secs which would
mean if the task consumes 100% of the CPU then it doesn't change a
thing.

Thank you Paul for the pointers.

> This is again a workaround.  Clearly, it would be better if we could
> eliminate that second rcuc wakeup.  I tried something similar some time
> back, and there was a problem with it.  I will see if I can reconstitute
> the corresponding brain cells.

Is my assumption correct, in order to push the grace period forward,
otherwise the whole is stuck?

> But in the meantime, one advantage of the workaround is that in the
> common case, it would reduce the number of rcuc wakeups to zero, rather
> than to just one.
> 
> Thoughts?

I *think* if what I just wrote is correct, I will either have to raise
the priority of rcuc/ or make the thread, that consumes 100% of the CPU
lose its RT priority. Then with the limited number of wakeups it should
be doable.

PS: I do remember the RCU-task thread we had. I did have an idea but I
need check if this is feasible first. So I did not forget, just slow…

> 							Thanx, Paul

Sebastian