On Thu, Sep 29, 2022 at 04:19:28PM +0800, Pingfan Liu wrote: > On Tue, Sep 27, 2022 at 5:59 PM Pingfan Liu <kernelfans@xxxxxxxxx> wrote: > > > > On Mon, Sep 26, 2022 at 03:23:52PM -0700, Paul E. McKenney wrote: > > > On Mon, Sep 26, 2022 at 02:34:17PM +0800, Pingfan Liu wrote: > > > > Sorry to reply late. I just realize this e-mail misses in my gmail. > > > > > > > > On Thu, Sep 22, 2022 at 06:54:42AM -0700, Paul E. McKenney wrote: > > > > [...] > > > > > > > > > > If you have tools/.../rcutorture/bin on your path, yes. This would default > > > > > to a 30-minute run. If you have at least 16 CPUs, you should add > > > > ^^^ TREE04 has CONFIG_NR_CPUS=8, so I think here the num is 8 > > > > > > Yes, you will get some benefit from --allcpus on systems with from 9-15 > > > CPUs as well as for 16 and more. At 8 CPUs, it wouldn't matter. > > > > > > > > "--allcpus" to do concurrrent runs. For example, given 64 CPUs you could > > > > > do this: > > > > > > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 10h --bootargs "rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30" --configs "4*TREE04" > > > > > > > > > > > > > I have tried to find a two socket system with 128 cpus and run > > > > sh kvm.sh --allcpus --duration 250h --bootargs rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30 --configs 16*TREE04 > > > > > > > > Where 250*16=4000 > > > > > > That would work. > > > > > > > This job has successfully run 24+ hours. (But maybe I can only keep it > > about 180 hours) > > > > > > > This would run four concurrent instances of the TREE04 scenario, each for > > > > > 10 hours, for a total of 40 hours of test time. > > > > > > > > > > > > It does take some time to run. I did 4,000 hours worth of TREE04 > > > > > > ^^^ '--duration=4000h' can serve this purpose? > > > > > > > > > > You could, at least if you replace the "=" with a space character, but > > > > > that really would run a six-month test, which is probably not what you > > > > > want to do. There being 8,760 hours in a year and all that. > > > > > > > > > > > Is it related with the cpu's freq? > > > > > > > > > > Not at all. '--duration 10h' would run ten hours of wall-clock time > > > > > regardless of the CPU frequencies. > > > > > > > > > > > > to confirm lack of bug. But an 80-CPU dual-socket system can run > > > > > > > 10 concurrent instances of TREE04, which gets things down to a more > > > > > > > > > > > > The total demanded hours H = 4000/(system_cpu_num/8)? > > > > > > > > > > Yes. You can also use multiple systems, which is what kvm-remote.sh is > > > > > intended for, again assuming 80 CPUs per system to keep the arithmetic > > > > > simple: > > > > > > > > > > tools/testing/selftests/rcutorture/bin/kvm-remote.sh "sys1 sys2 ... sys20" --duration 20h --cpus 80 --bootargs "rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30" --configs "200*TREE04" > > > > > > > > > > > > > That is appealing. > > > > > > > > I will see if any opportunity to grasp a batch of machines to run the > > > > test. > > > > > > Initial tests with smaller numbers of CPUs are also useful, for example, > > > in case reversion causes some bug due to bad interaction with a later > > > commit. > > > > > > Please let me know how it goes! > > > > > > > I have managed to grasp three two-socket machine, each has 256 cpus. > > The test has run about 7 hours till now without any problem by the following command: > > tools/testing/selftests/rcutorture/bin/kvm-remote.sh "sys1 sys2 sys3" \ > > --duration 45h --cpus 256 --bootargs "rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30" --configs "96*TREE04" > > > > It seems promising. > > > > The test is against v6.0-rc7 kernel, and only with 96926686deab ("rcu: > Make CPU-hotplug removal operations enable tick") reverted. It is > close to the end, but unfortunately it fails. > Quote from remote-log > " > TREE04.57 ------- 4410955 GPs (27.2281/s) [rcu: g36045577 f0x0 > total-gps=9011687] n_max_cbs: 4111392 > TREE04.58 ------- 4368391 GPs (26.9654/s) [rcu: g35630093 f0x0 > total-gps=8907816] n_max_cbs: 2411104 > TREE04.59 ------- 800516 GPs (4.94146/s) n_max_cbs: 3634471 > QEMU killed > TREE04.59 no success message, 10547 successful version messages > ^[[033mWARNING: ^[[mTREE04.59 GP HANG at 800516 torture stat 1925 > ^[[033mWARNING: ^[[mAssertion failure in > /home/linux/tools/testing/selftests/rcutorture/res/2022.09.26-23.33.34-remote/TREE04.59/console.log > TREE04.59 > ^[[033mWARNING: ^[[mSummary: Call Traces: 1 Stalls: 8615 > TREE04.6 ------- 4348443 GPs (26.8422/s) [rcu: g35341129 f0x0 > total-gps=8835575] n_max_cbs: 2329432 First, thank you for running this! This is not the typical failure that we were seeing, which would show up as a 2.199.0-second RCU CPU stall during which time there would be no console messages. But please do let me know how continuing tests go! Thanx, Paul > ... > ... > TREE04.91 ------- 4895716 GPs (30.2205/s) [rcu: g39322065 f0x0 > total-gps=9830808] n_max_cbs: 2208839 > TREE04.92 ------- 4902696 GPs (30.2636/s) [rcu: g39113441 f0x0 > total-gps=9778652] n_max_cbs: 1412377 > TREE04.93 ------- 4891393 GPs (30.1938/s) [rcu: g39244749 f0x0 > total-gps=9811481] n_max_cbs: 1772653 > TREE04.94 ------- 4921510 GPs (30.3797/s) [rcu: g39187349 f0x0 > total-gps=9797129] n_max_cbs: 1120534 > TREE04.95 ------- 4885795 GPs (30.1592/s) [rcu: g39020985 f0x0 > total-gps=9755538] n_max_cbs: 1178416 > TREE04.96 ------- 4889097 GPs (30.1796/s) [rcu: g39097057 f0x0 > total-gps=9774556] n_max_cbs: 1861434 > 1 runs with runtime errors. > --- Done at Wed Sep 28 08:40:31 PM EDT 2022 (1d 21:06:57) exitcode 2 > " > > Quote from console.log of TREE04.59 > " > ..... > [162001.696486] rcu-torture: rcu_torture_barrier_cbs is stopping > [162001.697004] rcu-torture: Stopping rcu_torture_fwd_prog task > [162001.697662] rcu_torture_fwd_prog n_max_cbs: 0 > [162001.698195] rcu_torture_fwd_prog: Starting forward-progress test 0 > [162001.698782] rcu_torture_fwd_prog_cr: Starting forward-progress test 0 > [162001.707571] rcu_torture_fwd_prog_cr: Waiting for CBs: > rcu_barrier+0x0/0x3b0() 0 > [162002.738504] rcu_torture_fwd_prog_nr: Starting forward-progress test 0 > [162002.746491] rcu_torture_fwd_prog_nr: Waiting for CBs: > rcu_barrier+0x0/0x3b0() 0 > [162002.850483] rcu_torture_fwd_prog: tested 2105 tested_tries 2107 > [162002.851008] rcu-torture: rcu_torture_fwd_prog is stopping > [162002.851542] rcu-torture: Stopping rcu_torture_writer task > [162004.530463] rcu-torture: rtc: 00000000ac003c99 ver: 800516 tfle: 0 > rta: 800517 rtaf: 0 rtf: 800507 rtmbe: 0 rtmbkf: 0/142699 rtbe: 0 > rtbke: 0 rtbre: 0 rtbf: 0 rtb: 0 nt: 205710931 onoff: > 185194/185194:185196/185196 1,1860:1,3263 25610601:47601063 (HZ=1000) > barrier: 773783/773783:0 read-exits: 184960 nocb-toggles: 0:0 > [162004.532583] rcu-torture: Reader Pipe: 343007605654 1359216 0 0 0 > 0 0 0 0 0 0 > [162004.533113] rcu-torture: Reader Batch: 342996212546 12752324 0 0 > 0 0 0 0 0 0 0 > [162004.533648] rcu-torture: Free-Block Circulation: 800516 800515 > 800514 800513 800512 800511 800510 800509 800508 800507 0 > [162004.534442] ??? Writer stall state RTWS_EXP_SYNC(4) g30755544 f0x0 > ->state 0x2 cpu 0 > [162004.535057] rcu: rcu_sched: wait state: RCU_GP_WAIT_GPS(1) > ->state: 0x402 ->rt_priority 0 delta ->gp_start 1674 ->gp_activity > 1670 ->gp_req_activity 1674 ->gp_wake_time 1674 ->gp_wake_seq 30755540 > ->gp_seq 30755544 ->gp_seq_needed 30755544 ->gp_max 989 ->gp_flags 0x0 > [162004.536805] rcu: CB 1^0->2 KbclSW F2838 L2838 C0 ..... q0 S CPU 0 > [162004.537277] rcu: CB 2^0->3 KbclSW F2911 L2911 C7 ..... q0 S CPU 0 > [162004.537742] rcu: CB 3^0->-1 KbclSW F1686 L1686 C2 ..... q0 S CPU 0 > [162004.538217] rcu: nocb GP 4 KldtS W[..] ..:0 rnp 4:7 2176869 S CPU 0 > [162004.538729] rcu: CB 4^4->5 KbclSW F2912 L2912 C7 ..... q0 S CPU 0 > [162004.539202] rcu: CB 5^4->6 KbclSW F2871 L2872 C1 ..... q0 S CPU 0 > [162004.539667] rcu: CB 6^4->7 KbclSW F4060 L4060 C0 ..... q0 S CPU 0 > [162004.540136] rcu: CB 7^4->-1 KbclSW F5763 L5763 C1 ..... q0 S CPU 0 > [162004.540653] rcu: RCU callbacks invoked since boot: 1431149091 > [162004.541076] rcu-torture: rcu_torture_stats is stopping > " > > I have no idea whether this is related to the reverted commit. > > > Thanks, > > Pingfan