On Tue, Sep 20, 2022 at 12:13:39PM -0700, Paul E. McKenney wrote: > On Tue, Sep 20, 2022 at 11:46:45AM +0200, Frederic Weisbecker wrote: > > On Tue, Sep 20, 2022 at 03:26:28PM +0800, Pingfan Liu wrote: > > > On Fri, Sep 16, 2022 at 03:42:58PM +0200, Frederic Weisbecker wrote: > > > > Note this is only locking the rdp's node, not the root node. > > > > Therefore if CPU 0 and CPU 256 are going off at the same time and they > > > > don't belong to the same node, the above won't protect against concurrent > > > > TICK_DEP_BIT_RCU set/clear. > > > > > > > > > > Nice, thanks for the careful thoughts. How about moving the counting > > > place to the root node? > > > > You could but then you'd need to lock the root node. > > > > > > My suspicion is that we don't need this TICK_DEP_BIT_RCU tick dependency > > > > anymore. I believe it was there because of issues that were fixed with: > > > > > > > > 53e87e3cdc15 (timers/nohz: Last resort update jiffies on nohz_full IRQ entry) > > > > and: > > > > > > > > a1ff03cd6fb9 (tick: Detect and fix jiffies update stall) > > > > > > > > It's unfortunately just suspicion because the reason for that tick dependency > > > > is unclear but I believe it should be safe to remove now. > > > > > > > > > > I have gone through this tick dependency again, but got less. > > > > > > I think at least from the RCU's viewpoint, it is useless since > > > multi_cpu_stop()->rcu_momentary_dyntick_idle() has eliminate the > > > requirement for tick interrupt. > > > > Partly yes. > > > > > Is there a way to have a convincing test so that these code can be removed? > > > Or this code will be got along with? > > > > Hmm, Paul might remember which rcutorture scenario would trigger it? > > TREE04 on multisocket systems, preferably with faster CPU-hotplug > operations. This can be accomplished by adding this to the kvm.sh > command line: > > rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30 > Is it ok with "sh kvm.sh --bootargs "rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30" --configs TREE04" > It does take some time to run. I did 4,000 hours worth of TREE04 ^^^ '--duration=4000h' can serve this purpose? Is it related with the cpu's freq? > to confirm lack of bug. But an 80-CPU dual-socket system can run > 10 concurrent instances of TREE04, which gets things down to a more The total demanded hours H = 4000/(system_cpu_num/8)? > manageable 400 hours. Please let me know if you don't have access > to a few such systems. > I am happy to have a try if needed. I will try to get a powerful machine, which can shrink the test time. > I will let Frederic identify which commit(s) should be reverted in > order to test the test. > My understanding is after removing the tick dep by diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 79aea7df4345..cbfc884f04a4 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2171,8 +2171,6 @@ int rcutree_dead_cpu(unsigned int cpu) WRITE_ONCE(rcu_state.n_online_cpus, rcu_state.n_online_cpus - 1); /* Adjust any no-longer-needed kthreads. */ rcu_boost_kthread_setaffinity(rnp, -1); - // Stop-machine done, so allow nohz_full to disable tick. - tick_dep_clear(TICK_DEP_BIT_RCU); return 0; } @@ -4008,8 +4006,6 @@ int rcutree_online_cpu(unsigned int cpu) sync_sched_exp_online_cleanup(cpu); rcutree_affinity_setting(cpu, -1); - // Stop-machine done, so allow nohz_full to disable tick. - tick_dep_clear(TICK_DEP_BIT_RCU); return 0; } @@ -4031,8 +4027,6 @@ int rcutree_offline_cpu(unsigned int cpu) rcutree_affinity_setting(cpu, cpu); - // nohz_full CPUs need the tick for stop-machine to work quickly - tick_dep_set(TICK_DEP_BIT_RCU); return 0; } If the TREE04 can success, then move on to revert the commit(s) identified by Frederic, and do test again. At this time, a TREE04 failure is expected. If the above two results are observed, TICK_DEP_BIT_RCU can be removed. Is my understanding right? Thanks, Pingfan