Re: [BUG almost bisected] Splat in dequeue_rt_stack() and build error

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Mon, 16 Dec 2024 11:36:25 -0800

On Mon, Dec 16, 2024 at 03:38:20PM +0100, Tomas Glozar wrote:
> ne 15. 12. 2024 v 19:41 odesílatel Paul E. McKenney <paulmck@xxxxxxxxxx> napsal:
> >
> > And the fix for the TREE03 too-short grace periods is finally in, at
> > least in prototype form:
> >
> > https://lore.kernel.org/all/da5065c4-79ba-431f-9d7e-1ca314394443@paulmck-laptop/
> >
> > Or this commit on -rcu:
> >
> > 22bee20913a1 ("rcu: Fix get_state_synchronize_rcu_full() GP-start detection")
> >
> > This passes more than 30 hours of 400 concurrent instances of rcutorture's
> > TREE03 scenario, with modifications that brought the bug reproduction
> > rate up to 50 per hour.  I therefore have strong reason to believe that
> > this fix is a real fix.
> >
> > With this fix in place, a 20-hour run of 400 concurrent instances
> > of rcutorture's TREE03 scenario resulted in 50 instances of the
> > enqueue_dl_entity() splat pair.  One (untrimmed) instance of this pair
> > of splats is shown below.
> >
> > You guys did reproduce this some time back, so unless you tell me
> > otherwise, I will assume that you have this in hand.  I would of course
> > be quite happy to help, especially with adding carefully chosen debug
> > (heisenbug and all that) or testing of alleged fixes.
> >
> 
> The same splat was recently reported to LKML [1] and a patchset was
> sent and merged into tip/sched/urgent that fixes a few bugs around
> double-enqueue of the deadline server [2]. I'm currently re-running
> TREE03 with those patches, hoping they will also fix this issue.

Thank you very much!

An initial four-hour test of 400 instances of an enhanced TREE03 ran
error-free.  I would have expected about 10 errors, so this gives me
99.9+% confidence that the patches improved things at least a little
bit and 99% confidence that these patches reduced the error rate by at
least a factor of two.

I am starting an overnight run.  If that completes without error, this
will provide 99% confidence that these patches reduced the error rate
by at least an order of magnitude.

> Also, last week I came up with some more extensive tracing, which
> showed dl_server_update and dl_server_start happening right after each
> other, apparently during the same run of enqueue_task_fair (see
> below). I'm currently looking into that to figure out whether the
> mechanism shown by the trace is fixed by the patchset.

And for whatever it is worth, judging by my console output, I am seeing
something similar.

							Thanx, Paul

> --------------------------
> 
> rcu_tort-148       1dN.3. 20531758076us : dl_server_stop <-dequeue_entities
> rcu_tort-148       1dN.2. 20531758076us : dl_server_queue: cpu=1
> level=2 enqueue=0
> rcu_tort-148       1dN.3. 20531758078us : <stack trace>
>  => trace_event_raw_event_dl_server_queue
>  => dl_server_stop
>  => dequeue_entities
>  => dequeue_task_fair
>  => __schedule
>  => schedule
>  => schedule_hrtimeout_range_clock
>  => torture_hrtimeout_us
>  => rcu_torture_writer
>  => kthread
>  => ret_from_fork
>  => ret_from_fork_asm
> rcu_tort-148       1dN.3. 20531758095us : dl_server_update <-update_curr
> rcu_tort-148       1dN.3. 20531758097us : dl_server_update <-update_curr
> rcu_tort-148       1dN.2. 20531758101us : dl_server_queue: cpu=1
> level=2 enqueue=1
> rcu_tort-148       1dN.3. 20531758103us : <stack trace>
> rcu_tort-148       1dN.2. 20531758104us : dl_server_queue: cpu=1
> level=1 enqueue=1
> rcu_tort-148       1dN.3. 20531758106us : <stack trace>
> rcu_tort-148       1dN.2. 20531758106us : dl_server_queue: cpu=1
> level=0 enqueue=1
> rcu_tort-148       1dN.3. 20531758108us : <stack trace>
>  => trace_event_raw_event_dl_server_queue
>  => rb_insert_color
>  => enqueue_dl_entity
>  => update_curr_dl_se
>  => update_curr
>  => enqueue_task_fair
>  => enqueue_task
>  => activate_task
>  => attach_task
>  => sched_balance_rq
>  => sched_balance_newidle.constprop.0
>  => pick_next_task_fair
>  => __schedule
>  => schedule
>  => schedule_hrtimeout_range_clock
>  => torture_hrtimeout_us
>  => rcu_torture_writer
>  => kthread
>  => ret_from_fork
>  => ret_from_fork_asm
> rcu_tort-148       1dN.3. 20531758110us : dl_server_start <-enqueue_task_fair
> rcu_tort-148       1dN.2. 20531758110us : dl_server_queue: cpu=1
> level=2 enqueue=1
> rcu_tort-148       1dN.3. 20531760934us : <stack trace>
>  => trace_event_raw_event_dl_server_queue
>  => enqueue_dl_entity
>  => dl_server_start
>  => enqueue_task_fair
>  => enqueue_task
>  => activate_task
>  => attach_task
>  => sched_balance_rq
>  => sched_balance_newidle.constprop.0
>  => pick_next_task_fair
>  => __schedule
>  => schedule
>  => schedule_hrtimeout_range_clock
>  => torture_hrtimeout_us
>  => rcu_torture_writer
>  => kthread
>  => ret_from_fork
>  => ret_from_fork_asm
> 
> [1] - https://lore.kernel.org/lkml/571b2045-320d-4ac2-95db-1423d0277613@xxxxxxx/
> [2] - https://lore.kernel.org/lkml/20241213032244.877029-1-vineeth@xxxxxxxxxxxxxxx/
> 
> > Just let me know!
> >
> >                                                         Thanx, Paul
> 
> Tomas
>