On Mon, Sep 26, 2022 at 02:34:17PM +0800, Pingfan Liu wrote: > Sorry to reply late. I just realize this e-mail misses in my gmail. > > On Thu, Sep 22, 2022 at 06:54:42AM -0700, Paul E. McKenney wrote: > [...] > > > > If you have tools/.../rcutorture/bin on your path, yes. This would default > > to a 30-minute run. If you have at least 16 CPUs, you should add > ^^^ TREE04 has CONFIG_NR_CPUS=8, so I think here the num is 8 Yes, you will get some benefit from --allcpus on systems with from 9-15 CPUs as well as for 16 and more. At 8 CPUs, it wouldn't matter. > > "--allcpus" to do concurrrent runs. For example, given 64 CPUs you could > > do this: > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 10h --bootargs "rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30" --configs "4*TREE04" > > > > I have tried to find a two socket system with 128 cpus and run > sh kvm.sh --allcpus --duration 250h --bootargs rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30 --configs 16*TREE04 > > Where 250*16=4000 That would work. > > This would run four concurrent instances of the TREE04 scenario, each for > > 10 hours, for a total of 40 hours of test time. > > > > > > It does take some time to run. I did 4,000 hours worth of TREE04 > > > ^^^ '--duration=4000h' can serve this purpose? > > > > You could, at least if you replace the "=" with a space character, but > > that really would run a six-month test, which is probably not what you > > want to do. There being 8,760 hours in a year and all that. > > > > > Is it related with the cpu's freq? > > > > Not at all. '--duration 10h' would run ten hours of wall-clock time > > regardless of the CPU frequencies. > > > > > > to confirm lack of bug. But an 80-CPU dual-socket system can run > > > > 10 concurrent instances of TREE04, which gets things down to a more > > > > > > The total demanded hours H = 4000/(system_cpu_num/8)? > > > > Yes. You can also use multiple systems, which is what kvm-remote.sh is > > intended for, again assuming 80 CPUs per system to keep the arithmetic > > simple: > > > > tools/testing/selftests/rcutorture/bin/kvm-remote.sh "sys1 sys2 ... sys20" --duration 20h --cpus 80 --bootargs "rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30" --configs "200*TREE04" > > > > That is appealing. > > I will see if any opportunity to grasp a batch of machines to run the > test. Initial tests with smaller numbers of CPUs are also useful, for example, in case reversion causes some bug due to bad interaction with a later commit. Please let me know how it goes! Thanx, Paul > Thanks, > > Pingfan > > Here "sys1" is the name of the system, on which you must have an account > > so that "ssh sys1 date" runs the date command on the first remote system. > > You really do have to use the "--cpus 80" because kvm-remote.sh does not > > assume that the system that it is running on is one of the test systems. > > > > > > manageable 400 hours. Please let me know if you don't have access > > > > to a few such systems. > > > > > > I am happy to have a try if needed. I will try to get a powerful > > > machine, which can shrink the test time. > > > > Larger numbers of little systems work, also, but in my experience you need > > a dual-socket system to have a reasonable chance of reproducing this bug. > > Each socket can be small, though, if that helps. > > > > If you work for a cloud provider or some such, you can probably get a > > large number of systems. If you can only get a few, you can do initial > > testing, and then we can work out what to do about heavier-duty testing. > > > > > > I will let Frederic identify which commit(s) should be reverted in > > > > order to test the test. > > > > > > > > > > My understanding is after removing the tick dep by > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > > index 79aea7df4345..cbfc884f04a4 100644 > > > --- a/kernel/rcu/tree.c > > > +++ b/kernel/rcu/tree.c > > > @@ -2171,8 +2171,6 @@ int rcutree_dead_cpu(unsigned int cpu) > > > WRITE_ONCE(rcu_state.n_online_cpus, rcu_state.n_online_cpus - 1); > > > /* Adjust any no-longer-needed kthreads. */ > > > rcu_boost_kthread_setaffinity(rnp, -1); > > > - // Stop-machine done, so allow nohz_full to disable tick. > > > - tick_dep_clear(TICK_DEP_BIT_RCU); > > > return 0; > > > } > > > > > > @@ -4008,8 +4006,6 @@ int rcutree_online_cpu(unsigned int cpu) > > > sync_sched_exp_online_cleanup(cpu); > > > rcutree_affinity_setting(cpu, -1); > > > > > > - // Stop-machine done, so allow nohz_full to disable tick. > > > - tick_dep_clear(TICK_DEP_BIT_RCU); > > > return 0; > > > } > > > > > > @@ -4031,8 +4027,6 @@ int rcutree_offline_cpu(unsigned int cpu) > > > > > > rcutree_affinity_setting(cpu, cpu); > > > > > > - // nohz_full CPUs need the tick for stop-machine to work quickly > > > - tick_dep_set(TICK_DEP_BIT_RCU); > > > return 0; > > > } > > > > > > If the TREE04 can success, then move on to revert the commit(s) > > > identified by Frederic, and do test again. > > > > > > At this time, a TREE04 failure is expected. > > > > > > If the above two results are observed, TICK_DEP_BIT_RCU can be > > > removed. > > > > > > Is my understanding right? > > > > Seems plausible to me, but I again defer to Frederic. > > > > Thanx, Paul