Re: [PATCHv2 3/3] rcu: coordinate tick dependency during concurrent offlining

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Sep 20, 2022 at 12:13:39PM -0700, Paul E. McKenney wrote:
> On Tue, Sep 20, 2022 at 11:46:45AM +0200, Frederic Weisbecker wrote:
> > On Tue, Sep 20, 2022 at 03:26:28PM +0800, Pingfan Liu wrote:
> > > On Fri, Sep 16, 2022 at 03:42:58PM +0200, Frederic Weisbecker wrote:
> > > > Note this is only locking the rdp's node, not the root node.
> > > > Therefore if CPU 0 and CPU 256 are going off at the same time and they
> > > > don't belong to the same node, the above won't protect against concurrent
> > > > TICK_DEP_BIT_RCU set/clear.
> > > > 
> > > 
> > > Nice, thanks for the careful thoughts. How about moving the counting
> > > place to the root node?
> > 
> > You could but then you'd need to lock the root node.
> > 
> > > > My suspicion is that we don't need this TICK_DEP_BIT_RCU tick dependency
> > > > anymore. I believe it was there because of issues that were fixed with:
> > > > 
> > > > 	53e87e3cdc15 (timers/nohz: Last resort update jiffies on nohz_full IRQ entry)
> > > > and:
> > > > 
> > > > 	a1ff03cd6fb9 (tick: Detect and fix jiffies update stall)
> > > > 
> > > > It's unfortunately just suspicion because the reason for that tick dependency
> > > > is unclear but I believe it should be safe to remove now.
> > > > 
> > > 
> > > I have gone through this tick dependency again, but got less.
> > > 
> > > I think at least from the RCU's viewpoint, it is useless since
> > > multi_cpu_stop()->rcu_momentary_dyntick_idle() has eliminate the
> > > requirement for tick interrupt.
> > 
> > Partly yes.
> > 
> > > Is there a way to have a convincing test so that these code can be removed?
> > > Or this code will be got along with?
> > 
> > Hmm, Paul might remember which rcutorture scenario would trigger it?
> 
> TREE04 on multisocket systems, preferably with faster CPU-hotplug
> operations.  This can be accomplished by adding this to the kvm.sh
> command line:
> 
> 	rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30
> 

Is it ok with "sh kvm.sh  --bootargs "rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30" --configs TREE04"

> It does take some time to run.  I did 4,000 hours worth of TREE04
                                        ^^^ '--duration=4000h' can serve this purpose?
Is it related with the cpu's freq?

> to confirm lack of bug.  But an 80-CPU dual-socket system can run
> 10 concurrent instances of TREE04, which gets things down to a more

The total demanded hours H = 4000/(system_cpu_num/8)?

> manageable 400 hours.  Please let me know if you don't have access
> to a few such systems.
>

I am happy to have a try if needed. I will try to get a powerful
machine, which can shrink the test time.

> I will let Frederic identify which commit(s) should be reverted in
> order to test the test.
> 

My understanding is after removing the tick dep by
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 79aea7df4345..cbfc884f04a4 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2171,8 +2171,6 @@ int rcutree_dead_cpu(unsigned int cpu)
        WRITE_ONCE(rcu_state.n_online_cpus, rcu_state.n_online_cpus - 1);
        /* Adjust any no-longer-needed kthreads. */
        rcu_boost_kthread_setaffinity(rnp, -1);
-       // Stop-machine done, so allow nohz_full to disable tick.
-       tick_dep_clear(TICK_DEP_BIT_RCU);
        return 0;
 }

@@ -4008,8 +4006,6 @@ int rcutree_online_cpu(unsigned int cpu)
        sync_sched_exp_online_cleanup(cpu);
        rcutree_affinity_setting(cpu, -1);

-       // Stop-machine done, so allow nohz_full to disable tick.
-       tick_dep_clear(TICK_DEP_BIT_RCU);
        return 0;
 }

@@ -4031,8 +4027,6 @@ int rcutree_offline_cpu(unsigned int cpu)

        rcutree_affinity_setting(cpu, cpu);

-       // nohz_full CPUs need the tick for stop-machine to work quickly
-       tick_dep_set(TICK_DEP_BIT_RCU);
        return 0;
 }

If the TREE04 can success, then move on to revert the commit(s)
identified by Frederic, and do test again.

At this time, a TREE04 failure is expected.

If the above two results are observed, TICK_DEP_BIT_RCU can be
removed.

Is my understanding right?


Thanks,

	Pingfan



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux