Re: [PATCHv2 3/3] rcu: coordinate tick dependency during concurrent offlining

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Sorry to reply late. I just realize this e-mail misses in my gmail.

On Thu, Sep 22, 2022 at 06:54:42AM -0700, Paul E. McKenney wrote:
[...]
> 
> If you have tools/.../rcutorture/bin on your path, yes.  This would default
> to a 30-minute run.  If you have at least 16 CPUs, you should add
                                            ^^^ TREE04 has CONFIG_NR_CPUS=8, so I think here the num is 8

> "--allcpus" to do concurrrent runs.  For example, given 64 CPUs you could
> do this:
> 
> tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 10h --bootargs "rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30" --configs "4*TREE04"
> 

I have tried to find a two socket system with 128 cpus and run
  sh kvm.sh --allcpus --duration 250h --bootargs rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30 --configs 16*TREE04

Where 250*16=4000


> This would run four concurrent instances of the TREE04 scenario, each for
> 10 hours, for a total of 40 hours of test time.
> 
> > > It does take some time to run.  I did 4,000 hours worth of TREE04
> >                                         ^^^ '--duration=4000h' can serve this purpose?
> 
> You could, at least if you replace the "=" with a space character, but
> that really would run a six-month test, which is probably not what you
> want to do.  There being 8,760 hours in a year and all that.
> 
> > Is it related with the cpu's freq?
> 
> Not at all.  '--duration 10h' would run ten hours of wall-clock time
> regardless of the CPU frequencies.
> 
> > > to confirm lack of bug.  But an 80-CPU dual-socket system can run
> > > 10 concurrent instances of TREE04, which gets things down to a more
> > 
> > The total demanded hours H = 4000/(system_cpu_num/8)?
> 
> Yes.  You can also use multiple systems, which is what kvm-remote.sh is
> intended for, again assuming 80 CPUs per system to keep the arithmetic
> simple:
> 
> tools/testing/selftests/rcutorture/bin/kvm-remote.sh "sys1 sys2 ... sys20" --duration 20h --cpus 80 --bootargs "rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30" --configs "200*TREE04"
> 

That is appealing.

I will see if any opportunity to grasp a batch of machines to run the
test.


Thanks,

	Pingfan
> Here "sys1" is the name of the system, on which you must have an account
> so that "ssh sys1 date" runs the date command on the first remote system.
> You really do have to use the "--cpus 80" because kvm-remote.sh does not
> assume that the system that it is running on is one of the test systems.
> 
> > > manageable 400 hours.  Please let me know if you don't have access
> > > to a few such systems.
> > 
> > I am happy to have a try if needed. I will try to get a powerful
> > machine, which can shrink the test time.
> 
> Larger numbers of little systems work, also, but in my experience you need
> a dual-socket system to have a reasonable chance of reproducing this bug.
> Each socket can be small, though, if that helps.
> 
> If you work for a cloud provider or some such, you can probably get a
> large number of systems.  If you can only get a few, you can do initial
> testing, and then we can work out what to do about heavier-duty testing.
> 
> > > I will let Frederic identify which commit(s) should be reverted in
> > > order to test the test.
> > > 
> > 
> > My understanding is after removing the tick dep by
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 79aea7df4345..cbfc884f04a4 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -2171,8 +2171,6 @@ int rcutree_dead_cpu(unsigned int cpu)
> >         WRITE_ONCE(rcu_state.n_online_cpus, rcu_state.n_online_cpus - 1);
> >         /* Adjust any no-longer-needed kthreads. */
> >         rcu_boost_kthread_setaffinity(rnp, -1);
> > -       // Stop-machine done, so allow nohz_full to disable tick.
> > -       tick_dep_clear(TICK_DEP_BIT_RCU);
> >         return 0;
> >  }
> > 
> > @@ -4008,8 +4006,6 @@ int rcutree_online_cpu(unsigned int cpu)
> >         sync_sched_exp_online_cleanup(cpu);
> >         rcutree_affinity_setting(cpu, -1);
> > 
> > -       // Stop-machine done, so allow nohz_full to disable tick.
> > -       tick_dep_clear(TICK_DEP_BIT_RCU);
> >         return 0;
> >  }
> > 
> > @@ -4031,8 +4027,6 @@ int rcutree_offline_cpu(unsigned int cpu)
> > 
> >         rcutree_affinity_setting(cpu, cpu);
> > 
> > -       // nohz_full CPUs need the tick for stop-machine to work quickly
> > -       tick_dep_set(TICK_DEP_BIT_RCU);
> >         return 0;
> >  }
> > 
> > If the TREE04 can success, then move on to revert the commit(s)
> > identified by Frederic, and do test again.
> > 
> > At this time, a TREE04 failure is expected.
> > 
> > If the above two results are observed, TICK_DEP_BIT_RCU can be
> > removed.
> > 
> > Is my understanding right?
> 
> Seems plausible to me, but I again defer to Frederic.
> 
> 							Thanx, Paul



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux