On Mon, Dec 16, 2024 at 03:38:20PM +0100, Tomas Glozar wrote: > ne 15. 12. 2024 v 19:41 odesílatel Paul E. McKenney <paulmck@xxxxxxxxxx> napsal: > > > > And the fix for the TREE03 too-short grace periods is finally in, at > > least in prototype form: > > > > https://lore.kernel.org/all/da5065c4-79ba-431f-9d7e-1ca314394443@paulmck-laptop/ > > > > Or this commit on -rcu: > > > > 22bee20913a1 ("rcu: Fix get_state_synchronize_rcu_full() GP-start detection") > > > > This passes more than 30 hours of 400 concurrent instances of rcutorture's > > TREE03 scenario, with modifications that brought the bug reproduction > > rate up to 50 per hour. I therefore have strong reason to believe that > > this fix is a real fix. > > > > With this fix in place, a 20-hour run of 400 concurrent instances > > of rcutorture's TREE03 scenario resulted in 50 instances of the > > enqueue_dl_entity() splat pair. One (untrimmed) instance of this pair > > of splats is shown below. > > > > You guys did reproduce this some time back, so unless you tell me > > otherwise, I will assume that you have this in hand. I would of course > > be quite happy to help, especially with adding carefully chosen debug > > (heisenbug and all that) or testing of alleged fixes. > > > > The same splat was recently reported to LKML [1] and a patchset was > sent and merged into tip/sched/urgent that fixes a few bugs around > double-enqueue of the deadline server [2]. I'm currently re-running > TREE03 with those patches, hoping they will also fix this issue. Thank you very much! An initial four-hour test of 400 instances of an enhanced TREE03 ran error-free. I would have expected about 10 errors, so this gives me 99.9+% confidence that the patches improved things at least a little bit and 99% confidence that these patches reduced the error rate by at least a factor of two. I am starting an overnight run. If that completes without error, this will provide 99% confidence that these patches reduced the error rate by at least an order of magnitude. > Also, last week I came up with some more extensive tracing, which > showed dl_server_update and dl_server_start happening right after each > other, apparently during the same run of enqueue_task_fair (see > below). I'm currently looking into that to figure out whether the > mechanism shown by the trace is fixed by the patchset. And for whatever it is worth, judging by my console output, I am seeing something similar. Thanx, Paul > -------------------------- > > rcu_tort-148 1dN.3. 20531758076us : dl_server_stop <-dequeue_entities > rcu_tort-148 1dN.2. 20531758076us : dl_server_queue: cpu=1 > level=2 enqueue=0 > rcu_tort-148 1dN.3. 20531758078us : <stack trace> > => trace_event_raw_event_dl_server_queue > => dl_server_stop > => dequeue_entities > => dequeue_task_fair > => __schedule > => schedule > => schedule_hrtimeout_range_clock > => torture_hrtimeout_us > => rcu_torture_writer > => kthread > => ret_from_fork > => ret_from_fork_asm > rcu_tort-148 1dN.3. 20531758095us : dl_server_update <-update_curr > rcu_tort-148 1dN.3. 20531758097us : dl_server_update <-update_curr > rcu_tort-148 1dN.2. 20531758101us : dl_server_queue: cpu=1 > level=2 enqueue=1 > rcu_tort-148 1dN.3. 20531758103us : <stack trace> > rcu_tort-148 1dN.2. 20531758104us : dl_server_queue: cpu=1 > level=1 enqueue=1 > rcu_tort-148 1dN.3. 20531758106us : <stack trace> > rcu_tort-148 1dN.2. 20531758106us : dl_server_queue: cpu=1 > level=0 enqueue=1 > rcu_tort-148 1dN.3. 20531758108us : <stack trace> > => trace_event_raw_event_dl_server_queue > => rb_insert_color > => enqueue_dl_entity > => update_curr_dl_se > => update_curr > => enqueue_task_fair > => enqueue_task > => activate_task > => attach_task > => sched_balance_rq > => sched_balance_newidle.constprop.0 > => pick_next_task_fair > => __schedule > => schedule > => schedule_hrtimeout_range_clock > => torture_hrtimeout_us > => rcu_torture_writer > => kthread > => ret_from_fork > => ret_from_fork_asm > rcu_tort-148 1dN.3. 20531758110us : dl_server_start <-enqueue_task_fair > rcu_tort-148 1dN.2. 20531758110us : dl_server_queue: cpu=1 > level=2 enqueue=1 > rcu_tort-148 1dN.3. 20531760934us : <stack trace> > => trace_event_raw_event_dl_server_queue > => enqueue_dl_entity > => dl_server_start > => enqueue_task_fair > => enqueue_task > => activate_task > => attach_task > => sched_balance_rq > => sched_balance_newidle.constprop.0 > => pick_next_task_fair > => __schedule > => schedule > => schedule_hrtimeout_range_clock > => torture_hrtimeout_us > => rcu_torture_writer > => kthread > => ret_from_fork > => ret_from_fork_asm > > [1] - https://lore.kernel.org/lkml/571b2045-320d-4ac2-95db-1423d0277613@xxxxxxx/ > [2] - https://lore.kernel.org/lkml/20241213032244.877029-1-vineeth@xxxxxxxxxxxxxxx/ > > > Just let me know! > > > > Thanx, Paul > > Tomas >