Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)

Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> · Mon, 11 Sep 2023 13:17:30 +0000

On Mon, Sep 11, 2023 at 01:16:21AM -0700, Paul E. McKenney wrote:
> On Mon, Sep 11, 2023 at 02:27:25AM +0000, Joel Fernandes wrote:
> > On Sun, Sep 10, 2023 at 07:37:13PM -0400, Joel Fernandes wrote:
> > > On Sun, Sep 10, 2023 at 5:16 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> > > >
> > > > On Sun, Sep 10, 2023 at 08:14:45PM +0000, Joel Fernandes wrote:
> > > [...]
> > > > > >  I have been running into another intermittent one as well which
> > > > > > is the boost failure and that happens once in 10-15 runs or so.
> > > > > >
> > > > > > I was thinking of running the following configuration on an automated
> > > > > > regular basis to at least provide a better clue on the lucky run that
> > > > > > catches an issue. But then the issue is it would change timing enough
> > > > > > to maybe hide bugs. I could also make it submit logs automatically to
> > > > > > the list on such occurrences, but one step at a time and all that.  I
> > > > > > do need to add (hopefully less noisy) tick/timer related trace events.
> > > > > >
> > > > > > # Define the bootargs array
> > > > > > bootargs=(
> > > > > >     "ftrace_dump_on_oops"
> > > > > >     "panic_on_warn=1"
> > > > > >     "sysctl.kernel.panic_on_rcu_stall=1"
> > > > > >     "sysctl.kernel.max_rcu_stall_to_panic=1"
> > > > > >     "trace_buf_size=10K"
> > > > > >     "traceoff_on_warning=1"
> > > > > >     "panic_print=0x1f"      # To dump held locks, mem and other info.
> > > > > > )
> > > > > > # Define the trace events array passed to bootargs.
> > > > > > trace_events=(
> > > > > >     "sched:sched_switch"
> > > > > >     "sched:sched_waking"
> > > > > >     "rcu:rcu_callback"
> > > > > >     "rcu:rcu_fqs"
> > > > > >     "rcu:rcu_quiescent_state_report"
> > > > > >     "rcu:rcu_grace_period"
> > > > > > )
> > > > >
> > > > > So some insight on this boost failure. Just before the boost failures are
> > > > > reported, I see the migration thread interferring with the rcu_preempt thread
> > > > > (aka GP kthread). See trace below. Of note is that the rcu_preempt thread is
> > > > > runnable while context switching, which means its execution is interferred.
> > > > > The rcu_preempt thread is at RT prio 2 as can be seen.
> > > > >
> > > > > So some open-ended questions: what exactly does the migration thread want,
> > > > > this is something related to CPU hotplug? And if the migration thread had to
> > > > > run, why did the rcu_preempt thread not get pushed to another CPU by the
> > > > > scheduler? We have 16 vCPUs for this test.
> > > >
> > > > Maybe we need a cpus_read_lock() before doing a given boost-test interval
> > > > and a cpus_read_unlock() after finishing one?  But much depends on
> > > > exactly what is starting those migration threads.
> > > 
> > > But in the field, a real RT task can preempt a reader without doing
> > > cpus_read_lock() and may run into a similar boost issue?
> 
> The sysctl_sched_rt_runtime should prevent a livelock in most
> configurations.  Here, rcutorture explicitly disables this.

I see. Though RT throttling will actually stall the rcu_preempt thread as
well in the real world. RT throttling is a bit broken and we're trying to fix
it in scheduler land. Even if there are idle CPUs, RT throttling will starve
not just the offending RT task, but all of them essentially causing a
priority inversion between running RT and CFS tasks.

> > > > Then again, TREE03 is pretty aggressive about doing CPU hotplug.
> > > 
> > > Ok. I put a trace_printk() in the stopper thread to see what the
> > > ->fn() is. I'm doing another run to see what falls out.
> > 
> > I think I am getting somewhere, It looks like the migration thread tries to
> > push the task out due to __balance_push_cpu_stop. Clearly it fails because
> > the rcu_preempt task is still on CPU 3. And the whole thing is live locked
> > retrying the same.
> > 
> > I am not sure if the migration thread is trying to migrate rcu_preempt
> > because the CPU is going down or because a higher priority RT task is
> > running.
> > 
> > rcu_pree-13        3d..4. 2936723336us : sched_waking: comm=migration/3 pid=34 prio=0 target_cpu=003
> > rcu_pree-13        3d..2. 2936723338us : sched_switch: prev_comm=rcu_preempt prev_pid=13 prev_prio=97 prev_state=R ==> next_comm=migration/3 next_pid=34 next_prio=0
> > migratio-34        3..... 2936723341us : cpu_stopper_thread: cpu_stop: caller=balance_push+0xc9/0x150 fn=__balance_push_cpu_stop+0x0/0x1c0 on cpu 3
> > migratio-34        3d..2. 2936723345us : sched_switch: prev_comm=migration/3 prev_pid=34 prev_prio=0 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
> > <idle>-0         3d..2. 2936723676us : sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=13 next_prio=97
> > rcu_pree-13        3d..4. 2936723678us : sched_waking: comm=migration/3 pid=34 prio=0 target_cpu=003
> > rcu_pree-13        3d..2. 2936723680us : sched_switch: prev_comm=rcu_preempt prev_pid=13 prev_prio=97 prev_state=R ==> next_comm=migration/3 next_pid=34 next_prio=0
> > migratio-34        3..... 2936723683us : cpu_stopper_thread: cpu_stop: caller=balance_push+0xc9/0x150 fn=__balance_push_cpu_stop+0x0/0x1c0 on cpu 3
> > 
> > I am digging deeper to see why the rcu_preempt thread cannot be pushed out
> > and then I'll also look at why is it being pushed out in the first place.
> > 
> > At least I have a strong repro now running 5 instances of TREE03 in parallel
> > for several hours.
> 
> Very good!  Then why not boot with rcutorture.onoff_interval=0 and see if
> the problem still occurs?  If yes, then there is definitely some reason
> other than CPU hotplug that makes this happen.

Sure, I'll do this now. If that disappears, I'll do the cpu_read_lock()
suggestion you mentioned. IMHO this is increasingly looking like a scheduling
issue with RT and too many RT tasks running at the same time where the
rcu_preempt thread migrates around looking for a CPU than it does any work.

Today morning's traces show that __balance_push_cpu_stop() actually
successfully moves the rcu_preempt task out of the CPU. But a few ms later,
it is back on the same CPU again only to have the migration thread try to
move it out again. Maybe the cpuhp threads are fighting with it. There's also
all the rcutorture boost kthreads that are all running at the same time
during the failure but those are lower priority than the rcu_preempt thread
in this test.

Also I am switching to some day job things while the onoff_interval=0 test
runs, thanks a lot Paul for your help!  I'll continue to look at this as I
can.

thanks,

 - Joel