Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Paul,
Thanks! I merged replies to 3 threads into this one to make it easier to follow.

On Fri, Sep 15, 2023 at 07:53:15AM -0700, Paul E. McKenney wrote:
> On Fri, Sep 15, 2023 at 11:33:13AM +0000, Joel Fernandes wrote:
> > On Fri, Sep 15, 2023 at 12:13:31AM +0000, Joel Fernandes wrote:
[...]
> > On the other hand, I came up with a real fix [1] and I am currently testing it.
> > This is to fix a live lock between RT push and CPU hotplug's
> > select_fallback_rq()-induced push. I am not sure if the fix works but I have
> > some faith based on what I'm seeing in traces. Fingers crossed. I also feel
> > the real fix is needed to prevent these issues even if we're able to hide it
> > by halving the total rcutorture boost threads.
> 
> This don't-schedule-on-dying CPUs approach does quite look promising
> to me!
> 
> Then again, I cannot claim to be a scheduler expert.  And I am a bit
> surprised that this does not already happen.  Which makes me wonder
> (admittedly without evidence either way) whether there is some CPU-hotplug
> race that it might induce.  But then again, figuring this sort of thing
> out is what part of the scheduler guys are there for, right?  ;-)

Yes it looks promising. Actually this sort of thing also seems to be already
done in CFS, it just not there in RT. So maybe it is OK. Testing so far
showed me pretty good results even with hotplug testing. Here is hoping!

> > > On the other hand, I came up with a real fix [1] and I am currently testing it.
> > > This is to fix a live lock between RT push and CPU hotplug's
> > > select_fallback_rq()-induced push. I am not sure if the fix works but I have
> > > some faith based on what I'm seeing in traces. Fingers crossed. I also feel
> > > the real fix is needed to prevent these issues even if we're able to hide it
> > > by halving the total rcutorture boost threads.
> > 
> > So that fixed it without any changes to RCU. Below is the updated patch also
> > for the archives. Though I'm rewriting it slightly differently and testing
> > that more. The main thing I am doing in the new patch is I find that RT
> > should not select !cpu_active() CPUs since those have the scheduler turned
> > off. Though checking for cpu_dying() also works. I could not find any
> > instance where cpu_dying() != cpu_active() but there could be a tiny window
> > where that is true. Anyway, I'll make some noise with scheduler folks once I
> > have the new version of the patch tested.
> > 
> > Also halving the number of RT boost threads makes it less likely to occur but
> > does not work. Not too surprising since the issue actually may not be related
> > to too many RT threads but rather a lockup between hotplug and RT..
> 
> Again, looks promising!  When I get the non-RCU -rcu stuff moved to
> v6.6-rc1 and appropriately branched and tested, I will give it a go on
> the test setup here.

Thanks a lot, and I have enclosed a simpler updated patch below which also
similarly shows very good results. This is the one I would like to test
more and send to scheduler folks. I'll send it out once I have it tested more
and also possibly after seeing your results (I am on vacation next week so
there's time).

> > We could run them on just the odd, or even ones and still be able to get
> > sufficient boost testing. This may be especially important without RT
> > throttling. I'll go ahead and queue a test like that.
> > 
> > Thoughts?
> 
> The problem with this is that it will often render RCU priority boosting
> unnecessary.  Any kthread preempted within an RCU read-side critical
> section will with high probability quickly be resumed on one of the
> even-numbered CPUs.
> 
> Or were you also planning to bind the rcu_torture_reader() kthreads to
> a specific CPU, preventing such migration?  Or am I missing something
> here?

You are right, I see now why you were running them on all CPUs. One note
though, af the RCU reader threads are CFS, they will not be immediately
pushed out so maybe it may work (unlike if the RCU reader threads being
preempted were RT). However testing shows thread-halving does not fix the
issue anyway so we can scratch that. Instead, below is the updated patch for
don't schedule-on-dying/inactive CPUs approach which is showing really good
results!

And I'll most likely see you the week after, after entering into a 4-day
quiescent state. ;-)

thanks,

 - Joel

---8<-----------------------

From: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx>
Subject: [PATCH] RT: Alternative fix for livelock with hotplug

Signed-off-by: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx>
---
 kernel/sched/cpupri.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index a286e726eb4b..42c40cfdf836 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -101,6 +101,7 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
 
 	if (lowest_mask) {
 		cpumask_and(lowest_mask, &p->cpus_mask, vec->mask);
+		cpumask_and(lowest_mask, lowest_mask, cpu_active_mask);
 
 		/*
 		 * We have to ensure that we have at least one bit
-- 
2.42.0.459.ge4e396fd5e-goog




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux