ne 15. 12. 2024 v 19:41 odesílatel Paul E. McKenney <paulmck@xxxxxxxxxx> napsal: > > And the fix for the TREE03 too-short grace periods is finally in, at > least in prototype form: > > https://lore.kernel.org/all/da5065c4-79ba-431f-9d7e-1ca314394443@paulmck-laptop/ > > Or this commit on -rcu: > > 22bee20913a1 ("rcu: Fix get_state_synchronize_rcu_full() GP-start detection") > > This passes more than 30 hours of 400 concurrent instances of rcutorture's > TREE03 scenario, with modifications that brought the bug reproduction > rate up to 50 per hour. I therefore have strong reason to believe that > this fix is a real fix. > > With this fix in place, a 20-hour run of 400 concurrent instances > of rcutorture's TREE03 scenario resulted in 50 instances of the > enqueue_dl_entity() splat pair. One (untrimmed) instance of this pair > of splats is shown below. > > You guys did reproduce this some time back, so unless you tell me > otherwise, I will assume that you have this in hand. I would of course > be quite happy to help, especially with adding carefully chosen debug > (heisenbug and all that) or testing of alleged fixes. > The same splat was recently reported to LKML [1] and a patchset was sent and merged into tip/sched/urgent that fixes a few bugs around double-enqueue of the deadline server [2]. I'm currently re-running TREE03 with those patches, hoping they will also fix this issue. Also, last week I came up with some more extensive tracing, which showed dl_server_update and dl_server_start happening right after each other, apparently during the same run of enqueue_task_fair (see below). I'm currently looking into that to figure out whether the mechanism shown by the trace is fixed by the patchset. -------------------------- rcu_tort-148 1dN.3. 20531758076us : dl_server_stop <-dequeue_entities rcu_tort-148 1dN.2. 20531758076us : dl_server_queue: cpu=1 level=2 enqueue=0 rcu_tort-148 1dN.3. 20531758078us : <stack trace> => trace_event_raw_event_dl_server_queue => dl_server_stop => dequeue_entities => dequeue_task_fair => __schedule => schedule => schedule_hrtimeout_range_clock => torture_hrtimeout_us => rcu_torture_writer => kthread => ret_from_fork => ret_from_fork_asm rcu_tort-148 1dN.3. 20531758095us : dl_server_update <-update_curr rcu_tort-148 1dN.3. 20531758097us : dl_server_update <-update_curr rcu_tort-148 1dN.2. 20531758101us : dl_server_queue: cpu=1 level=2 enqueue=1 rcu_tort-148 1dN.3. 20531758103us : <stack trace> rcu_tort-148 1dN.2. 20531758104us : dl_server_queue: cpu=1 level=1 enqueue=1 rcu_tort-148 1dN.3. 20531758106us : <stack trace> rcu_tort-148 1dN.2. 20531758106us : dl_server_queue: cpu=1 level=0 enqueue=1 rcu_tort-148 1dN.3. 20531758108us : <stack trace> => trace_event_raw_event_dl_server_queue => rb_insert_color => enqueue_dl_entity => update_curr_dl_se => update_curr => enqueue_task_fair => enqueue_task => activate_task => attach_task => sched_balance_rq => sched_balance_newidle.constprop.0 => pick_next_task_fair => __schedule => schedule => schedule_hrtimeout_range_clock => torture_hrtimeout_us => rcu_torture_writer => kthread => ret_from_fork => ret_from_fork_asm rcu_tort-148 1dN.3. 20531758110us : dl_server_start <-enqueue_task_fair rcu_tort-148 1dN.2. 20531758110us : dl_server_queue: cpu=1 level=2 enqueue=1 rcu_tort-148 1dN.3. 20531760934us : <stack trace> => trace_event_raw_event_dl_server_queue => enqueue_dl_entity => dl_server_start => enqueue_task_fair => enqueue_task => activate_task => attach_task => sched_balance_rq => sched_balance_newidle.constprop.0 => pick_next_task_fair => __schedule => schedule => schedule_hrtimeout_range_clock => torture_hrtimeout_us => rcu_torture_writer => kthread => ret_from_fork => ret_from_fork_asm [1] - https://lore.kernel.org/lkml/571b2045-320d-4ac2-95db-1423d0277613@xxxxxxx/ [2] - https://lore.kernel.org/lkml/20241213032244.877029-1-vineeth@xxxxxxxxxxxxxxx/ > Just let me know! > > Thanx, Paul Tomas