Re: [PATCHv2 3/3] rcu: coordinate tick dependency during concurrent offlining

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Sep 27, 2022 at 5:59 PM Pingfan Liu <kernelfans@xxxxxxxxx> wrote:
>
> On Mon, Sep 26, 2022 at 03:23:52PM -0700, Paul E. McKenney wrote:
> > On Mon, Sep 26, 2022 at 02:34:17PM +0800, Pingfan Liu wrote:
> > > Sorry to reply late. I just realize this e-mail misses in my gmail.
> > >
> > > On Thu, Sep 22, 2022 at 06:54:42AM -0700, Paul E. McKenney wrote:
> > > [...]
> > > >
> > > > If you have tools/.../rcutorture/bin on your path, yes.  This would default
> > > > to a 30-minute run.  If you have at least 16 CPUs, you should add
> > >                                             ^^^ TREE04 has CONFIG_NR_CPUS=8, so I think here the num is 8
> >
> > Yes, you will get some benefit from --allcpus on systems with from 9-15
> > CPUs as well as for 16 and more.  At 8 CPUs, it wouldn't matter.
> >
> > > > "--allcpus" to do concurrrent runs.  For example, given 64 CPUs you could
> > > > do this:
> > > >
> > > > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 10h --bootargs "rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30" --configs "4*TREE04"
> > > >
> > >
> > > I have tried to find a two socket system with 128 cpus and run
> > >   sh kvm.sh --allcpus --duration 250h --bootargs rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30 --configs 16*TREE04
> > >
> > > Where 250*16=4000
> >
> > That would work.
> >
>
> This job has successfully run 24+ hours. (But maybe I can only keep it
> about 180 hours)
>
> > > > This would run four concurrent instances of the TREE04 scenario, each for
> > > > 10 hours, for a total of 40 hours of test time.
> > > >
> > > > > > It does take some time to run.  I did 4,000 hours worth of TREE04
> > > > >                                         ^^^ '--duration=4000h' can serve this purpose?
> > > >
> > > > You could, at least if you replace the "=" with a space character, but
> > > > that really would run a six-month test, which is probably not what you
> > > > want to do.  There being 8,760 hours in a year and all that.
> > > >
> > > > > Is it related with the cpu's freq?
> > > >
> > > > Not at all.  '--duration 10h' would run ten hours of wall-clock time
> > > > regardless of the CPU frequencies.
> > > >
> > > > > > to confirm lack of bug.  But an 80-CPU dual-socket system can run
> > > > > > 10 concurrent instances of TREE04, which gets things down to a more
> > > > >
> > > > > The total demanded hours H = 4000/(system_cpu_num/8)?
> > > >
> > > > Yes.  You can also use multiple systems, which is what kvm-remote.sh is
> > > > intended for, again assuming 80 CPUs per system to keep the arithmetic
> > > > simple:
> > > >
> > > > tools/testing/selftests/rcutorture/bin/kvm-remote.sh "sys1 sys2 ... sys20" --duration 20h --cpus 80 --bootargs "rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30" --configs "200*TREE04"
> > > >
> > >
> > > That is appealing.
> > >
> > > I will see if any opportunity to grasp a batch of machines to run the
> > > test.
> >
> > Initial tests with smaller numbers of CPUs are also useful, for example,
> > in case reversion causes some bug due to bad interaction with a later
> > commit.
> >
> > Please let me know how it goes!
> >
>
> I have managed to grasp three two-socket machine, each has 256 cpus.
> The test has run about 7 hours till now without any problem by the following command:
> tools/testing/selftests/rcutorture/bin/kvm-remote.sh "sys1 sys2 sys3" \
> --duration 45h --cpus 256 --bootargs "rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30" --configs "96*TREE04"
>
> It seems promising.
>

The test is against v6.0-rc7 kernel, and only with 96926686deab ("rcu:
Make CPU-hotplug removal operations enable tick") reverted. It is
close to the end, but unfortunately it fails.
Quote from remote-log
"
TREE04.57 ------- 4410955 GPs (27.2281/s) [rcu: g36045577 f0x0
total-gps=9011687] n_max_cbs: 4111392
TREE04.58 ------- 4368391 GPs (26.9654/s) [rcu: g35630093 f0x0
total-gps=8907816] n_max_cbs: 2411104
TREE04.59 ------- 800516 GPs (4.94146/s) n_max_cbs: 3634471
QEMU killed
TREE04.59 no success message, 10547 successful version messages
^[[033mWARNING: ^[[mTREE04.59 GP HANG at 800516 torture stat 1925
^[[033mWARNING: ^[[mAssertion failure in
/home/linux/tools/testing/selftests/rcutorture/res/2022.09.26-23.33.34-remote/TREE04.59/console.log
TREE04.59
^[[033mWARNING: ^[[mSummary: Call Traces: 1 Stalls: 8615
TREE04.6 ------- 4348443 GPs (26.8422/s) [rcu: g35341129 f0x0
total-gps=8835575] n_max_cbs: 2329432
...
...
TREE04.91 ------- 4895716 GPs (30.2205/s) [rcu: g39322065 f0x0
total-gps=9830808] n_max_cbs: 2208839
TREE04.92 ------- 4902696 GPs (30.2636/s) [rcu: g39113441 f0x0
total-gps=9778652] n_max_cbs: 1412377
TREE04.93 ------- 4891393 GPs (30.1938/s) [rcu: g39244749 f0x0
total-gps=9811481] n_max_cbs: 1772653
TREE04.94 ------- 4921510 GPs (30.3797/s) [rcu: g39187349 f0x0
total-gps=9797129] n_max_cbs: 1120534
TREE04.95 ------- 4885795 GPs (30.1592/s) [rcu: g39020985 f0x0
total-gps=9755538] n_max_cbs: 1178416
TREE04.96 ------- 4889097 GPs (30.1796/s) [rcu: g39097057 f0x0
total-gps=9774556] n_max_cbs: 1861434
1 runs with runtime errors.
 --- Done at Wed Sep 28 08:40:31 PM EDT 2022 (1d 21:06:57) exitcode 2
"

Quote from  console.log of TREE04.59
"
.....
[162001.696486] rcu-torture: rcu_torture_barrier_cbs is stopping
[162001.697004] rcu-torture: Stopping rcu_torture_fwd_prog task
[162001.697662] rcu_torture_fwd_prog n_max_cbs: 0
[162001.698195] rcu_torture_fwd_prog: Starting forward-progress test 0
[162001.698782] rcu_torture_fwd_prog_cr: Starting forward-progress test 0
[162001.707571] rcu_torture_fwd_prog_cr: Waiting for CBs:
rcu_barrier+0x0/0x3b0() 0
[162002.738504] rcu_torture_fwd_prog_nr: Starting forward-progress test 0
[162002.746491] rcu_torture_fwd_prog_nr: Waiting for CBs:
rcu_barrier+0x0/0x3b0() 0
[162002.850483] rcu_torture_fwd_prog: tested 2105 tested_tries 2107
[162002.851008] rcu-torture: rcu_torture_fwd_prog is stopping
[162002.851542] rcu-torture: Stopping rcu_torture_writer task
[162004.530463] rcu-torture: rtc: 00000000ac003c99 ver: 800516 tfle: 0
rta: 800517 rtaf: 0 rtf: 800507 rtmbe: 0 rtmbkf: 0/142699 rtbe: 0
rtbke: 0 rtbre: 0 rtbf: 0 rtb: 0 nt: 205710931 onoff:
185194/185194:185196/185196 1,1860:1,3263 25610601:47601063 (HZ=1000)
barrier: 773783/773783:0 read-exits: 184960 nocb-toggles: 0:0
[162004.532583] rcu-torture: Reader Pipe:  343007605654 1359216 0 0 0
0 0 0 0 0 0
[162004.533113] rcu-torture: Reader Batch:  342996212546 12752324 0 0
0 0 0 0 0 0 0
[162004.533648] rcu-torture: Free-Block Circulation:  800516 800515
800514 800513 800512 800511 800510 800509 800508 800507 0
[162004.534442] ??? Writer stall state RTWS_EXP_SYNC(4) g30755544 f0x0
->state 0x2 cpu 0
[162004.535057] rcu: rcu_sched: wait state: RCU_GP_WAIT_GPS(1)
->state: 0x402 ->rt_priority 0 delta ->gp_start 1674 ->gp_activity
1670 ->gp_req_activity 1674 ->gp_wake_time 1674 ->gp_wake_seq 30755540
->gp_seq 30755544 ->gp_seq_needed 30755544 ->gp_max 989 ->gp_flags 0x0
[162004.536805] rcu:    CB 1^0->2 KbclSW F2838 L2838 C0 ..... q0 S CPU 0
[162004.537277] rcu:    CB 2^0->3 KbclSW F2911 L2911 C7 ..... q0 S CPU 0
[162004.537742] rcu:    CB 3^0->-1 KbclSW F1686 L1686 C2 ..... q0 S CPU 0
[162004.538217] rcu: nocb GP 4 KldtS W[..] ..:0 rnp 4:7 2176869 S CPU 0
[162004.538729] rcu:    CB 4^4->5 KbclSW F2912 L2912 C7 ..... q0 S CPU 0
[162004.539202] rcu:    CB 5^4->6 KbclSW F2871 L2872 C1 ..... q0 S CPU 0
[162004.539667] rcu:    CB 6^4->7 KbclSW F4060 L4060 C0 ..... q0 S CPU 0
[162004.540136] rcu:    CB 7^4->-1 KbclSW F5763 L5763 C1 ..... q0 S CPU 0
[162004.540653] rcu: RCU callbacks invoked since boot: 1431149091
[162004.541076] rcu-torture: rcu_torture_stats is stopping
"

I have no idea whether this is related to the reverted commit.


Thanks,

Pingfan

Attachment: remote-log
Description: Binary data


[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux