On Sun, Sep 10, 2023 at 5:16 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: > > On Sun, Sep 10, 2023 at 08:14:45PM +0000, Joel Fernandes wrote: [...] > > > I have been running into another intermittent one as well which > > > is the boost failure and that happens once in 10-15 runs or so. > > > > > > I was thinking of running the following configuration on an automated > > > regular basis to at least provide a better clue on the lucky run that > > > catches an issue. But then the issue is it would change timing enough > > > to maybe hide bugs. I could also make it submit logs automatically to > > > the list on such occurrences, but one step at a time and all that. I > > > do need to add (hopefully less noisy) tick/timer related trace events. > > > > > > # Define the bootargs array > > > bootargs=( > > > "ftrace_dump_on_oops" > > > "panic_on_warn=1" > > > "sysctl.kernel.panic_on_rcu_stall=1" > > > "sysctl.kernel.max_rcu_stall_to_panic=1" > > > "trace_buf_size=10K" > > > "traceoff_on_warning=1" > > > "panic_print=0x1f" # To dump held locks, mem and other info. > > > ) > > > # Define the trace events array passed to bootargs. > > > trace_events=( > > > "sched:sched_switch" > > > "sched:sched_waking" > > > "rcu:rcu_callback" > > > "rcu:rcu_fqs" > > > "rcu:rcu_quiescent_state_report" > > > "rcu:rcu_grace_period" > > > ) > > > > So some insight on this boost failure. Just before the boost failures are > > reported, I see the migration thread interferring with the rcu_preempt thread > > (aka GP kthread). See trace below. Of note is that the rcu_preempt thread is > > runnable while context switching, which means its execution is interferred. > > The rcu_preempt thread is at RT prio 2 as can be seen. > > > > So some open-ended questions: what exactly does the migration thread want, > > this is something related to CPU hotplug? And if the migration thread had to > > run, why did the rcu_preempt thread not get pushed to another CPU by the > > scheduler? We have 16 vCPUs for this test. > > Maybe we need a cpus_read_lock() before doing a given boost-test interval > and a cpus_read_unlock() after finishing one? But much depends on > exactly what is starting those migration threads. But in the field, a real RT task can preempt a reader without doing cpus_read_lock() and may run into a similar boost issue? > Then again, TREE03 is pretty aggressive about doing CPU hotplug. Ok. I put a trace_printk() in the stopper thread to see what the ->fn() is. I'm doing another run to see what falls out. thanks, - Joel