On Mon, Oct 21, 2024 at 12:25:41PM -0700, Paul E. McKenney wrote: > On Mon, Oct 14, 2024 at 11:55:05AM -0700, Paul E. McKenney wrote: [ . . . ] > > But no big wins thus far, so this will be a slow process. My current test > > disables CPU hotplug. I will be disabling other things in the hope of > > better identifying the code paths that should be placed under suspicion. The "this will be a slow process" was no joke... > Disabling CPU hotplug seems to make the problem go away (though > all I can really say is that I am 99% confident that it reduces the > problem's incidence by at least a factor of two). This problem affects > non-preemptible kernels and non-preemptible RCU, though it is possible > that use of the latter reduces the failure rate (which is of course *not* > what you want for testing). > > A number of experiments failed to significantly/usefully increase the > failure rate. > > The problem does not seem to happen on straight normal call_rcu() > grace periods (rcutorture.gp_normal=1), synchronize_rcu() grace periods > (rcutorture.gp_sync=1), or synchronize_rcu_expedited() grace periods. > Of course, my reproduction rate is still low enough that I might be > fooled here. > > However, the problem does occur reasonably often on polled grace periods > (rcutorture.gp_poll=1 rcutorture.gp_poll_exp=1 rcutorture.gp_poll_full=1 > rcutorture.gp_poll_exp_full=1). This might be a bug in the polling > code itself, or it might be because polling grace periods do not incur > callback and/or wakeup delays (as in the bug might still be in the > underlying grace-period computation and polling makes it more apparent). > This also appears to have made the problem happen more frequently, > but not enough data to be sure yet. > > Currently, rcutorture does millisecond-scale waits of duration randomly > chosen between zero and 15 milliseconds. I have started a run with > microsecond-scale waits of duration randomly chosen between zero and > 128 microseconds. Here is hoping that this will cause the problem to > reproduce more quickly, and I will know more this evening, Pacific Time. Well, that evening was a long time ago, but here finally is an update. After some time, varying the wait duration between zero and 16 microseconds with nanosecond granularity pushed the rate up to between 10 and 20 per hour. This allowed me to find one entertaining bug, whose fix may be found here in my -rcu tree: 9dfca26bcbc8 ("rcu: Make expedited grace periods wait for initialization") The fix ensures that an expedited grace period is fully initialized before honoring any quiescent-state reports, thus avoiding a failure scenario in which one of a pair of quiescent-state reports could "leak" into the next expedited grace period. But only if a concurrent CPU-hotplug operation shows up at just the wrong time. There are also a couple of other minor fixes of things like needless lockless accesses: 6142841a2389 ("rcu: Make rcu_report_exp_cpu_mult() caller acquire lock") dd8104928722 ("rcu: Move rcu_report_exp_rdp() setting of ->cpu_no_qs.b.exp under lock") Plus quite a few additional debug checks. So problem solved, right? Wrong!!! Those changes at best reduced the bug rate by about 10%. So I am still beating on this thing. But you will be happy (or more likely not) to learn that the enqueue_dl_entity() splats that I was chasing when starting on this bug still occasionally make their presence known. ;-) Added diagnostics pushed the bug rate down to about four per hours, which isn't quite as nice as between 10 and 20 per hour, but is still something I can work with. Back to beating on it. More info than anyone needs is available here: https://docs.google.com/document/d/1-JQ4QYF1qid0TWSLa76O1kusdhER2wgm0dYdwFRUzU8/edit?usp=sharing Thanx, Paul