On Fri, Sep 15, 2023 at 12:57 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: > [...] > > > > > On the other hand, I came up with a real fix [1] and I am currently testing it. > > > > > This is to fix a live lock between RT push and CPU hotplug's > > > > > select_fallback_rq()-induced push. I am not sure if the fix works but I have > > > > > some faith based on what I'm seeing in traces. Fingers crossed. I also feel > > > > > the real fix is needed to prevent these issues even if we're able to hide it > > > > > by halving the total rcutorture boost threads. > > > > > > > > So that fixed it without any changes to RCU. Below is the updated patch also > > > > for the archives. Though I'm rewriting it slightly differently and testing > > > > that more. The main thing I am doing in the new patch is I find that RT > > > > should not select !cpu_active() CPUs since those have the scheduler turned > > > > off. Though checking for cpu_dying() also works. I could not find any > > > > instance where cpu_dying() != cpu_active() but there could be a tiny window > > > > where that is true. Anyway, I'll make some noise with scheduler folks once I > > > > have the new version of the patch tested. > > > > > > > > Also halving the number of RT boost threads makes it less likely to occur but > > > > does not work. Not too surprising since the issue actually may not be related > > > > to too many RT threads but rather a lockup between hotplug and RT.. > > > > > > Again, looks promising! When I get the non-RCU -rcu stuff moved to > > > v6.6-rc1 and appropriately branched and tested, I will give it a go on > > > the test setup here. > > > > Thanks a lot, and I have enclosed a simpler updated patch below which also > > similarly shows very good results. This is the one I would like to test > > more and send to scheduler folks. I'll send it out once I have it tested more > > and also possibly after seeing your results (I am on vacation next week so > > there's time). > > Much nicer! This is just on current mainline, correct? Yes, correct. I also applied it cleanly to all stable kernels for my test rigs. Only 5.10 had a little merge conflict but it was trivially fixed. thanks, - Joel