Re: [PATCHv2 3/3] rcu: coordinate tick dependency during concurrent offlining

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Nov 10, 2022 at 2:55 AM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
>
> On Mon, Nov 07, 2022 at 08:07:26AM -0800, Paul E. McKenney wrote:
> > On Thu, Nov 03, 2022 at 09:51:43AM -0700, Paul E. McKenney wrote:
> > > On Mon, Oct 31, 2022 at 11:24:37AM +0800, Pingfan Liu wrote:
> > > > On Fri, Oct 28, 2022 at 1:46 AM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> > > > >
> > > > > On Mon, Oct 10, 2022 at 09:55:26AM +0800, Pingfan Liu wrote:
> > > > > > On Mon, Oct 3, 2022 at 12:20 AM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> > > > > > >
> > > > > > [...]
> > > > > >
> > > > > > > >
> > > > > > > > But unfortunately, I did not keep the data. I will run it again and
> > > > > > > > submit the data.
> > > > > > >
> > > > > >
> > > > > > I have finished the test on a machine with two sockets and 256 cpus.
> > > > > > The test runs against the kernel with three commits reverted.
> > > > > >   96926686deab ("rcu: Make CPU-hotplug removal operations enable tick")
> > > > > >   53e87e3cdc15 ("timers/nohz: Last resort update jiffies on nohz_full
> > > > > > IRQ entry")
> > > > > >   a1ff03cd6fb9c5 ("tick: Detect and fix jiffies update stall")
> > > > > >
> > > > > > Summary from console.log
> > > > > > "
> > > > > >  --- Sat Oct  8 11:34:02 AM EDT 2022 Test summary:
> > > > > > Results directory:
> > > > > > /home/linux/tools/testing/selftests/rcutorture/res/2022.10.07-23.10.54
> > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration
> > > > > > 125h --bootargs rcutorture.onoff_interval=200
> > > > > > rcutorture.onoff_holdoff=30 --configs 32*TREE04
> > > > > > TREE04 ------- 1365444 GPs (3.03432/s) n_max_cbs: 850290
> > > > > > TREE04 no success message, 2897 successful version messages
> > > > > > Completed in 44512 vs. 450000
> > > > > > TREE04.10 ------- 1331565 GPs (2.95903/s) n_max_cbs: 909075
> > > > > > TREE04.10 no success message, 2897 successful version messages
> > > > > > Completed in 44511 vs. 450000
> > > > > > TREE04.11 ------- 1331535 GPs (2.95897/s) n_max_cbs: 1213974
> > > > > > TREE04.11 no success message, 2897 successful version messages
> > > > > > Completed in 44511 vs. 450000
> > > > > > TREE04.12 ------- 1322160 GPs (2.93813/s) n_max_cbs: 2615313
> > > > > > TREE04.12 no success message, 2897 successful version messages
> > > > > > Completed in 44511 vs. 450000
> > > > > > TREE04.13 ------- 1320032 GPs (2.9334/s) n_max_cbs: 914751
> > > > > > TREE04.13 no success message, 2897 successful version messages
> > > > > > Completed in 44511 vs. 450000
> > > > > > TREE04.14 ------- 1339969 GPs (2.97771/s) n_max_cbs: 1560203
> > > > > > TREE04.14 no success message, 2897 successful version messages
> > > > > > Completed in 44511 vs. 450000
> > > > > > TREE04.15 ------- 1318805 GPs (2.93068/s) n_max_cbs: 1757478
> > > > > > TREE04.15 no success message, 2897 successful version messages
> > > > > > Completed in 44510 vs. 450000
> > > > > > TREE04.16 ------- 1340633 GPs (2.97918/s) n_max_cbs: 1377647
> > > > > > TREE04.16 no success message, 2897 successful version messages
> > > > > > Completed in 44510 vs. 450000
> > > > > > TREE04.17 ------- 1322798 GPs (2.93955/s) n_max_cbs: 1266344
> > > > > > TREE04.17 no success message, 2897 successful version messages
> > > > > > Completed in 44511 vs. 450000
> > > > > > TREE04.18 ------- 1346302 GPs (2.99178/s) n_max_cbs: 1030713
> > > > > > TREE04.18 no success message, 2897 successful version messages
> > > > > > Completed in 44511 vs. 450000
> > > > > > TREE04.19 ------- 1322499 GPs (2.93889/s) n_max_cbs: 917118
> > > > > > TREE04.19 no success message, 2897 successful version messages
> > > > > > Completed in 44511 vs. 450000
> > > > > > ...
> > > > > > TREE04.4 ------- 1310283 GPs (2.91174/s) n_max_cbs: 2146905
> > > > > > TREE04.4 no success message, 2897 successful version messages
> > > > > > Completed in 44511 vs. 450000
> > > > > > TREE04.5 ------- 1333238 GPs (2.96275/s) n_max_cbs: 1027172
> > > > > > TREE04.5 no success message, 2897 successful version messages
> > > > > > Completed in 44511 vs. 450000
> > > > > > TREE04.6 ------- 1313915 GPs (2.91981/s) n_max_cbs: 1017511
> > > > > > TREE04.6 no success message, 2897 successful version messages
> > > > > > Completed in 44511 vs. 450000
> > > > > > TREE04.7 ------- 1341871 GPs (2.98194/s) n_max_cbs: 816265
> > > > > > TREE04.7 no success message, 2897 successful version messages
> > > > > > Completed in 44511 vs. 450000
> > > > > > TREE04.8 ------- 1339412 GPs (2.97647/s) n_max_cbs: 1316404
> > > > > > TREE04.8 no success message, 2897 successful version messages
> > > > > > Completed in 44511 vs. 450000
> > > > > > TREE04.9 ------- 1327240 GPs (2.94942/s) n_max_cbs: 1409531
> > > > > > TREE04.9 no success message, 2897 successful version messages
> > > > > > Completed in 44510 vs. 450000
> > > > > > 32 runs with runtime errors.
> > > > > >  --- Done at Sat Oct  8 11:34:10 AM EDT 2022 (12:23:16) exitcode 2
> > > > > > "
> > > > > > I have no idea about the test so just arbitrarily pick up the
> > > > > > console.log of TREE04.10 as an example. Please get it from attachment.
> > > > >
> > > > > Very good, thank you!
> > > > >
> > > > > Could you please clearly indicate what you tested?  For example, if
> > > > > you have an externally visible git tree, please point me at the tree
> > > > > and the SHA-1.  Or send a patch series clearly indicating what it is
> > > > > based on.
> > > > >
> > > >
> > > > Yes, it is a good way to eliminate any unexpected mistakes before a rigid test.
> > > >
> > > > Please clone it from https://github.com/pfliu/linux.git  branch:
> > > > rcu#revert_tick_dep
> > >
> > > Thank you very much!
> > >
> > > > > Then I can try a long run on a larger collection of systems.
> > > > >
> > > >
> > > > Thank you very much.
> > > >
> > > > > If that works out, we can see about adjustments to mainline.  ;-)
> > > > >
> > > >
> > > > Eager to see.
> > >
> > > I ran 200 hours of TREE04 and got an RCU CPU stall warning.  I ran 2000
> > > hours on v6.0, which precedes these commits, and everything passed.
> > >
> > > I will run more, primarily on v6.0, but that is what I have thus far.
> > > At the moment, I have some concerns about this change.
> >
> > OK, so I have run a total of 8000 hours on v6.0 without failure.  I have
> > run 4200 hours on rcu#revert_tick_dep with 15 failures.  The ones I
> > looked at were RCU CPU stall warnings with timer failures.
> >
> > This data suggests that the kernel is not yet ready for that commit
> > to be reverted.
>
> Even if the tests pass, can we really survive with this patch
> that he reverted?
> https://github.com/pfliu/linux/commit/03179ef33e8e2608184ade99a27f760f9d01e6b7
>
> If stop machine on a CPU spends a good amount of time in kernel mode, while a
> grace period starts on another CPU, then we're kind of screwed if we don't
> have the tick enabled right?
>

In this case, I think multi_cpu_stop()->rcu_momentary_dyntick_idle()
can serve this purpose.

Thanks,

    Pingfan

> Or, did we make any changes to stop machine such that, that's no longer an
> issue?
>
> thanks,
>
>  - Joel
>



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux