On Mon, Nov 21, 2022 at 11:48:29AM +0800, Pingfan Liu wrote: > On Sat, Nov 19, 2022 at 7:30 AM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: > > > > On Fri, Nov 18, 2022 at 08:08:35PM +0800, Pingfan Liu wrote: > > > On Thu, Nov 10, 2022 at 2:55 AM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote: > > > > > > > > On Mon, Nov 07, 2022 at 08:07:26AM -0800, Paul E. McKenney wrote: > > > > > On Thu, Nov 03, 2022 at 09:51:43AM -0700, Paul E. McKenney wrote: > > > > > > On Mon, Oct 31, 2022 at 11:24:37AM +0800, Pingfan Liu wrote: > > > > > > > On Fri, Oct 28, 2022 at 1:46 AM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > On Mon, Oct 10, 2022 at 09:55:26AM +0800, Pingfan Liu wrote: > > > > > > > > > On Mon, Oct 3, 2022 at 12:20 AM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > But unfortunately, I did not keep the data. I will run it again and > > > > > > > > > > > submit the data. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I have finished the test on a machine with two sockets and 256 cpus. > > > > > > > > > The test runs against the kernel with three commits reverted. > > > > > > > > > 96926686deab ("rcu: Make CPU-hotplug removal operations enable tick") > > > > > > > > > 53e87e3cdc15 ("timers/nohz: Last resort update jiffies on nohz_full > > > > > > > > > IRQ entry") > > > > > > > > > a1ff03cd6fb9c5 ("tick: Detect and fix jiffies update stall") > > > > > > > > > > > > > > > > > > Summary from console.log > > > > > > > > > " > > > > > > > > > --- Sat Oct 8 11:34:02 AM EDT 2022 Test summary: > > > > > > > > > Results directory: > > > > > > > > > /home/linux/tools/testing/selftests/rcutorture/res/2022.10.07-23.10.54 > > > > > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration > > > > > > > > > 125h --bootargs rcutorture.onoff_interval=200 > > > > > > > > > rcutorture.onoff_holdoff=30 --configs 32*TREE04 > > > > > > > > > TREE04 ------- 1365444 GPs (3.03432/s) n_max_cbs: 850290 > > > > > > > > > TREE04 no success message, 2897 successful version messages > > > > > > > > > Completed in 44512 vs. 450000 > > > > > > > > > TREE04.10 ------- 1331565 GPs (2.95903/s) n_max_cbs: 909075 > > > > > > > > > TREE04.10 no success message, 2897 successful version messages > > > > > > > > > Completed in 44511 vs. 450000 > > > > > > > > > TREE04.11 ------- 1331535 GPs (2.95897/s) n_max_cbs: 1213974 > > > > > > > > > TREE04.11 no success message, 2897 successful version messages > > > > > > > > > Completed in 44511 vs. 450000 > > > > > > > > > TREE04.12 ------- 1322160 GPs (2.93813/s) n_max_cbs: 2615313 > > > > > > > > > TREE04.12 no success message, 2897 successful version messages > > > > > > > > > Completed in 44511 vs. 450000 > > > > > > > > > TREE04.13 ------- 1320032 GPs (2.9334/s) n_max_cbs: 914751 > > > > > > > > > TREE04.13 no success message, 2897 successful version messages > > > > > > > > > Completed in 44511 vs. 450000 > > > > > > > > > TREE04.14 ------- 1339969 GPs (2.97771/s) n_max_cbs: 1560203 > > > > > > > > > TREE04.14 no success message, 2897 successful version messages > > > > > > > > > Completed in 44511 vs. 450000 > > > > > > > > > TREE04.15 ------- 1318805 GPs (2.93068/s) n_max_cbs: 1757478 > > > > > > > > > TREE04.15 no success message, 2897 successful version messages > > > > > > > > > Completed in 44510 vs. 450000 > > > > > > > > > TREE04.16 ------- 1340633 GPs (2.97918/s) n_max_cbs: 1377647 > > > > > > > > > TREE04.16 no success message, 2897 successful version messages > > > > > > > > > Completed in 44510 vs. 450000 > > > > > > > > > TREE04.17 ------- 1322798 GPs (2.93955/s) n_max_cbs: 1266344 > > > > > > > > > TREE04.17 no success message, 2897 successful version messages > > > > > > > > > Completed in 44511 vs. 450000 > > > > > > > > > TREE04.18 ------- 1346302 GPs (2.99178/s) n_max_cbs: 1030713 > > > > > > > > > TREE04.18 no success message, 2897 successful version messages > > > > > > > > > Completed in 44511 vs. 450000 > > > > > > > > > TREE04.19 ------- 1322499 GPs (2.93889/s) n_max_cbs: 917118 > > > > > > > > > TREE04.19 no success message, 2897 successful version messages > > > > > > > > > Completed in 44511 vs. 450000 > > > > > > > > > ... > > > > > > > > > TREE04.4 ------- 1310283 GPs (2.91174/s) n_max_cbs: 2146905 > > > > > > > > > TREE04.4 no success message, 2897 successful version messages > > > > > > > > > Completed in 44511 vs. 450000 > > > > > > > > > TREE04.5 ------- 1333238 GPs (2.96275/s) n_max_cbs: 1027172 > > > > > > > > > TREE04.5 no success message, 2897 successful version messages > > > > > > > > > Completed in 44511 vs. 450000 > > > > > > > > > TREE04.6 ------- 1313915 GPs (2.91981/s) n_max_cbs: 1017511 > > > > > > > > > TREE04.6 no success message, 2897 successful version messages > > > > > > > > > Completed in 44511 vs. 450000 > > > > > > > > > TREE04.7 ------- 1341871 GPs (2.98194/s) n_max_cbs: 816265 > > > > > > > > > TREE04.7 no success message, 2897 successful version messages > > > > > > > > > Completed in 44511 vs. 450000 > > > > > > > > > TREE04.8 ------- 1339412 GPs (2.97647/s) n_max_cbs: 1316404 > > > > > > > > > TREE04.8 no success message, 2897 successful version messages > > > > > > > > > Completed in 44511 vs. 450000 > > > > > > > > > TREE04.9 ------- 1327240 GPs (2.94942/s) n_max_cbs: 1409531 > > > > > > > > > TREE04.9 no success message, 2897 successful version messages > > > > > > > > > Completed in 44510 vs. 450000 > > > > > > > > > 32 runs with runtime errors. > > > > > > > > > --- Done at Sat Oct 8 11:34:10 AM EDT 2022 (12:23:16) exitcode 2 > > > > > > > > > " > > > > > > > > > I have no idea about the test so just arbitrarily pick up the > > > > > > > > > console.log of TREE04.10 as an example. Please get it from attachment. > > > > > > > > > > > > > > > > Very good, thank you! > > > > > > > > > > > > > > > > Could you please clearly indicate what you tested? For example, if > > > > > > > > you have an externally visible git tree, please point me at the tree > > > > > > > > and the SHA-1. Or send a patch series clearly indicating what it is > > > > > > > > based on. > > > > > > > > > > > > > > > > > > > > > > Yes, it is a good way to eliminate any unexpected mistakes before a rigid test. > > > > > > > > > > > > > > Please clone it from https://github.com/pfliu/linux.git branch: > > > > > > > rcu#revert_tick_dep > > > > > > > > > > > > Thank you very much! > > > > > > > > > > > > > > Then I can try a long run on a larger collection of systems. > > > > > > > > > > > > > > > > > > > > > > Thank you very much. > > > > > > > > > > > > > > > If that works out, we can see about adjustments to mainline. ;-) > > > > > > > > > > > > > > > > > > > > > > Eager to see. > > > > > > > > > > > > I ran 200 hours of TREE04 and got an RCU CPU stall warning. I ran 2000 > > > > > > hours on v6.0, which precedes these commits, and everything passed. > > > > > > > > > > > > I will run more, primarily on v6.0, but that is what I have thus far. > > > > > > At the moment, I have some concerns about this change. > > > > > > > > > > OK, so I have run a total of 8000 hours on v6.0 without failure. I have > > > > > run 4200 hours on rcu#revert_tick_dep with 15 failures. The ones I > > > > > looked at were RCU CPU stall warnings with timer failures. > > > > > > > > > > This data suggests that the kernel is not yet ready for that commit > > > > > to be reverted. > > > > > > > > Even if the tests pass, can we really survive with this patch > > > > that he reverted? > > > > https://github.com/pfliu/linux/commit/03179ef33e8e2608184ade99a27f760f9d01e6b7 > > > > > > > > If stop machine on a CPU spends a good amount of time in kernel mode, while a > > > > grace period starts on another CPU, then we're kind of screwed if we don't > > > > have the tick enabled right? > > > > > > In this case, I think multi_cpu_stop()->rcu_momentary_dyntick_idle() > > > can serve this purpose. > > > > There are a lot of things that can go wrong in this scenario. Does that > > added rcu_momentary_dyntick_idle() cover everything that still needs to > > be covered? > > Absorbed by the context "while a grace period starts on another CPU", > I assumed it is from the RCU perspective. Then I think every non-idle > cpu can report its quiescent state in that case. > > But from a system perspective, I think that it is promising but not > sure. And that is what the TREE04 torture test tries to verify. The specific case I was thinking of is when all but the incoming CPU are spinning with interrupts disabled, and the incoming CPU is getting hammered with scheduler-tick interrupts. In that case, the RCU grace-period kthread won't be running anyway. So there will be nothing to notice the calls to rcu_momentary_dyntick_idle(). Thanx, Paul