Re: [PATCHv2 3/3] rcu: coordinate tick dependency during concurrent offlining

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Sep 26, 2022 at 02:34:17PM +0800, Pingfan Liu wrote:
> Sorry to reply late. I just realize this e-mail misses in my gmail.
> 
> On Thu, Sep 22, 2022 at 06:54:42AM -0700, Paul E. McKenney wrote:
> [...]
> > 
> > If you have tools/.../rcutorture/bin on your path, yes.  This would default
> > to a 30-minute run.  If you have at least 16 CPUs, you should add
>                                             ^^^ TREE04 has CONFIG_NR_CPUS=8, so I think here the num is 8

Yes, you will get some benefit from --allcpus on systems with from 9-15
CPUs as well as for 16 and more.  At 8 CPUs, it wouldn't matter.

> > "--allcpus" to do concurrrent runs.  For example, given 64 CPUs you could
> > do this:
> > 
> > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 10h --bootargs "rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30" --configs "4*TREE04"
> > 
> 
> I have tried to find a two socket system with 128 cpus and run
>   sh kvm.sh --allcpus --duration 250h --bootargs rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30 --configs 16*TREE04
> 
> Where 250*16=4000

That would work.

> > This would run four concurrent instances of the TREE04 scenario, each for
> > 10 hours, for a total of 40 hours of test time.
> > 
> > > > It does take some time to run.  I did 4,000 hours worth of TREE04
> > >                                         ^^^ '--duration=4000h' can serve this purpose?
> > 
> > You could, at least if you replace the "=" with a space character, but
> > that really would run a six-month test, which is probably not what you
> > want to do.  There being 8,760 hours in a year and all that.
> > 
> > > Is it related with the cpu's freq?
> > 
> > Not at all.  '--duration 10h' would run ten hours of wall-clock time
> > regardless of the CPU frequencies.
> > 
> > > > to confirm lack of bug.  But an 80-CPU dual-socket system can run
> > > > 10 concurrent instances of TREE04, which gets things down to a more
> > > 
> > > The total demanded hours H = 4000/(system_cpu_num/8)?
> > 
> > Yes.  You can also use multiple systems, which is what kvm-remote.sh is
> > intended for, again assuming 80 CPUs per system to keep the arithmetic
> > simple:
> > 
> > tools/testing/selftests/rcutorture/bin/kvm-remote.sh "sys1 sys2 ... sys20" --duration 20h --cpus 80 --bootargs "rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30" --configs "200*TREE04"
> > 
> 
> That is appealing.
> 
> I will see if any opportunity to grasp a batch of machines to run the
> test.

Initial tests with smaller numbers of CPUs are also useful, for example,
in case reversion causes some bug due to bad interaction with a later
commit.

Please let me know how it goes!

							Thanx, Paul

> Thanks,
> 
> 	Pingfan
> > Here "sys1" is the name of the system, on which you must have an account
> > so that "ssh sys1 date" runs the date command on the first remote system.
> > You really do have to use the "--cpus 80" because kvm-remote.sh does not
> > assume that the system that it is running on is one of the test systems.
> > 
> > > > manageable 400 hours.  Please let me know if you don't have access
> > > > to a few such systems.
> > > 
> > > I am happy to have a try if needed. I will try to get a powerful
> > > machine, which can shrink the test time.
> > 
> > Larger numbers of little systems work, also, but in my experience you need
> > a dual-socket system to have a reasonable chance of reproducing this bug.
> > Each socket can be small, though, if that helps.
> > 
> > If you work for a cloud provider or some such, you can probably get a
> > large number of systems.  If you can only get a few, you can do initial
> > testing, and then we can work out what to do about heavier-duty testing.
> > 
> > > > I will let Frederic identify which commit(s) should be reverted in
> > > > order to test the test.
> > > > 
> > > 
> > > My understanding is after removing the tick dep by
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 79aea7df4345..cbfc884f04a4 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -2171,8 +2171,6 @@ int rcutree_dead_cpu(unsigned int cpu)
> > >         WRITE_ONCE(rcu_state.n_online_cpus, rcu_state.n_online_cpus - 1);
> > >         /* Adjust any no-longer-needed kthreads. */
> > >         rcu_boost_kthread_setaffinity(rnp, -1);
> > > -       // Stop-machine done, so allow nohz_full to disable tick.
> > > -       tick_dep_clear(TICK_DEP_BIT_RCU);
> > >         return 0;
> > >  }
> > > 
> > > @@ -4008,8 +4006,6 @@ int rcutree_online_cpu(unsigned int cpu)
> > >         sync_sched_exp_online_cleanup(cpu);
> > >         rcutree_affinity_setting(cpu, -1);
> > > 
> > > -       // Stop-machine done, so allow nohz_full to disable tick.
> > > -       tick_dep_clear(TICK_DEP_BIT_RCU);
> > >         return 0;
> > >  }
> > > 
> > > @@ -4031,8 +4027,6 @@ int rcutree_offline_cpu(unsigned int cpu)
> > > 
> > >         rcutree_affinity_setting(cpu, cpu);
> > > 
> > > -       // nohz_full CPUs need the tick for stop-machine to work quickly
> > > -       tick_dep_set(TICK_DEP_BIT_RCU);
> > >         return 0;
> > >  }
> > > 
> > > If the TREE04 can success, then move on to revert the commit(s)
> > > identified by Frederic, and do test again.
> > > 
> > > At this time, a TREE04 failure is expected.
> > > 
> > > If the above two results are observed, TICK_DEP_BIT_RCU can be
> > > removed.
> > > 
> > > Is my understanding right?
> > 
> > Seems plausible to me, but I again defer to Frederic.
> > 
> > 							Thanx, Paul



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux