Re: SRCU: kworker hung in synchronize_srcu

Zhouyi Zhou <zhouzhouyi@xxxxxxxxx> · Fri, 29 Sep 2023 09:50:23 +0800



On Fri, Sep 29, 2023 at 6:05 AM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
>
> Hello,
> Firstly, kudos to the detailed report and analysis. Rare failures are
> hard and your usage crash/kdump is awesome to dig deeper into the
> issue..
>
> On Thu, Sep 28, 2023 at 3:59 AM zhuangel570 <zhuangel570@xxxxxxxxx> wrote:
> >
> > Hi,
> >
> > We encounter SRCU hung issue in stable tree 5.4.203, we are running VM create
> > and destroy concurrent test, the issue happens after several weeks. Now we
> > didn't have a way to reproduce this issue, the issue happens randomly, this
> > is the second time we found it in this year.
> >
> > We did some investigation with crash tools, hope to get some useful clues
> > about the issue, try to find root cause of the issue (further find a way to
> > fix or workaround it), here is some of our investigation, hope to get
> > feedback from SRCU experts.
> >
> > [3144162.608392] INFO: task kworker/136:12:252684 blocked for more
> > than 122 seconds.
> > [3144162.615986]       Tainted: G           O  K   5.4.203-1-tlinux4-0011.1 #1
> > [3144162.623053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [3144162.631162] kworker/136:12  D    0 252684      2 0x90004000
> > [3144162.631189] Workqueue: kvm-irqfd-cleanup irqfd_shutdown [kvm]
> > [3144162.631192] Call Trace:
> > [3144162.631202]  __schedule+0x2ee/0x660
> > [3144162.631206]  schedule+0x33/0xa0
> > [3144162.631209]  schedule_timeout+0x1c4/0x340
> > [3144162.631214]  ? update_load_avg+0x82/0x660
> > [3144162.631217]  ? raw_spin_rq_lock_nested+0x1f/0x30
> > [3144162.631218]  wait_for_completion+0x119/0x180
> > [3144162.631220]  ? wake_up_q+0x80/0x80
> > [3144162.631224]  __synchronize_srcu.part.19+0x81/0xb0
> > [3144162.631226]  ? __bpf_trace_rcu_utilization+0x10/0x10
> > [3144162.631227]  synchronize_srcu+0x5f/0xc0
> > [3144162.631236]  irqfd_shutdown+0x3c/0xb0 [kvm]
> > [3144162.631239]  ? __schedule+0x2f6/0x660
> > [3144162.631243]  process_one_work+0x19a/0x3a0
> > [3144162.631244]  worker_thread+0x37/0x3a0
> > [3144162.631247]  kthread+0x117/0x140
> > [3144162.631247]  ? process_one_work+0x3a0/0x3a0
> > [3144162.631248]  ? __kthread_cancel_work+0x40/0x40
> > [3144162.631250]  ret_from_fork+0x1f/0x30
> >
> > Test workload:
> > - Repeated concurrency testing, include Create VM, service probe, destroy VM.
> > - VM using virito device, so kvm->irq_srcu is using frequently.
> > - x86_64 Machine with 191 core.
> >
> > Crash analysis clues:
> > - The hung task is run on CPU 136, waiting synchronize_srcu to be completed.
> > - The current GP of SRCU struct is 304 (srcu_gp_seq and srcu_gp_seq_needed).
> > - The SNP (node) of SRCU struct has 2 levels.
> > - The SNP (node) of SRCU struct show all SNP now in GP 304 (srcu_have_cbs).
> > - The SNP (node) of SRCU struct show CPU 136 belongs to group [128,143].
> > - The SDP (per-cpu) of SRCU struct show lock and unlock matches.
> > - The SDP (per-cpu) of CPU 136 segcblist has one entry in pending list.
> > - The pending entry is just the calling entry of hung kwoker task.
> > - The SDP (per-cpu) of CPU 136 is waiting GP 288, gp_seq of segcblist is 296.
> > - The SDP (per-cpu) of CPU 136 segcblist has no entry in DONE list.
> > - The work-queue of SRCU and SDP of CPU 136 is idle.
> > - More details could be found via reference at the end of this mail.
> >
> > crash> struct srcu_struct.srcu_idx,srcu_gp_seq,srcu_gp_seq_needed,srcu_gp_seq_needed_exp,sda
> > 0xffffc900f100e878
> >         srcu_idx = 76
> >         srcu_gp_seq = 304
> >         srcu_gp_seq_needed = 304
> >         srcu_gp_seq_needed_exp = 304
> >         sda = 0x5fff9ac17800
> >
> > crash> struct srcu_data 0x5fff9ac17800:136
> > [136]: ffffe8bee9817800
> > ...
> >   srcu_cblist = {
> >     head = 0xffffc900e8413dd0,
> >     tails = {0xffffe8bee9817848, 0xffffc900e8413dd0,
> > 0xffffc900e8413dd0, 0xffffc900e8413dd0},
> >     gp_seq = {0, 296, 296, 0},
>
> I am actually a bit puzzled why the DONE segment count is 0. I would
> have expected it to be >= 296 in this situation.
>
> >     len = 1,
> >     len_lazy = 0,
> >     enabled = 1 '\001',
> >     offloaded = 0 '\000'
> >   },
> >   srcu_gp_seq_needed = 288,
> >   srcu_gp_seq_needed_exp = 0,
> >   srcu_cblist_invoking = false,
> > ...
> >
> > Root cause (we can't reproduce the issue, so the result is base on crash
> > analysis and code analysis):
> > - The leaked entry on segcblist has incorrect gp_seq (SDP is waiting 288).
> > - The leaked entry on segcblist seems has been "accelerated".
> > - The incorrect accelerate maybe caused by srcu_invoke_callbacks, because
> >   srcu_invoke_callbacks may be using incorrect GP seq doing accelerate.
> > - Here are two emulated case may cause the "accelerate issue":
> >
> >
> > CASE1: entry of CPU 136 belongs to GP 288 was accelerated to GP 292
> > - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks.
> > - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry
> >   to SDP.
> > - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run
> >   faster than CPU 136, start GP, set rcu_seq_start.
> > - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need
> >   to start GP.
> > - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks,
> >   "accelerate" the seq of new added entry to 292 (it should be 288).
>
> But srcu_gp_seq is at 304 right now. How does it matter that the CB is
> marked for 292? It should be ready to execute anyway even at 292. Note
> the meaning of "acceleration", the idea is to start conservatively and
> move the callbacks forward as more accurate information is available.
> Considering this, 292 initially should be OK IMHO (that's just more
> conservative than 288)..
>
> > CASE2: entry of CPU 136 belongs to GP 288 was accelerated to GP 296
> > - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks.
> > - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry
> >   to SDP.
> > - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run
> >   faster than CPU 136, start GP.
> > - [CPU 041] [GP 284-288] finished, workload run faster, finish GP.
> > - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need
> >   to start GP.
> > - [CPU 017] [GP 288-292] starting, new synchronize_srcu request, start GP,
> >   set rcu_seq_start.
> > - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks,
> >   "accelerate" new added entry to 296 (it should be 288).
>
> I had the same analysis here.
>
> > /*
> >  * CASE1
> >  * - entry of CPU 136 expecting GP 288 was accelerated to GP 292
> >  * - GP0: 280-284, GP1: 284-288.
> >  */
> >
> > /* [GP0][CPU-136] */
> > process_srcu {
> >  srcu_gp_end
> > }
> >                 /* [GP1][CPU-136] */
> >                 synchronize_srcu {
> >                  __call_srcu {
> >                   rcu_segcblist_enqueue
> >                                          /* [GP1][CPU-041] */
> >                                          synchronize_srcu {
> >                                           __call_srcu {
> >                                            srcu_funnel_gp_start
> >                                             srcu_gp_start
> >                                              rcu_seq_start
> >                                           }
> >                                          }
> >                 /* [GP1][CPU-136] */
> >                   srcu_funnel_gp_start
> >                  }
> >                 }
> > /* [GP0][CPU-136] */
> > srcu_invoke_callbacks {
> >  rcu_segcblist_accelerate
> > }
> >
> >
> > /*
> >  * CASE2
> >  * - entry of CPU 136 belongs to GP 288 was accelerated to GP 296.
> >  * - GP0: 280-284, GP1: 284-288, GP2: 288-292.
> >  */
> >
> > /* [GP0][CPU-136] */
> > process_srcu {
> >  srcu_gp_end
> > }
> >
> >                   /* [GP1][CPU-136] */
> >                   synchronize_srcu {
> >                    __call_srcu {
> >                     rcu_segcblist_enqueue
> >                                            /* [GP1][CPU-041] */
> >                                            synchronize_srcu {
> >                                             __call_srcu {
> >                                              srcu_funnel_gp_start
> >                                               srcu_gp_start
> >                                             }
> >                                            }
> >                                            process_srcu {
> >                                             srcu_gp_end
> >                                              rcu_seq_end
> >                                            }
> >                   /* [GP1][CPU-136] */
> >                     srcu_funnel_gp_start
> >                    }
> >                   }
> > /* [GP0][CPU-136] */
> > srcu_invoke_callbacks {
>
> If srcu_invoke_callbacks() was really called for the rdp, I would have
> expected rcu_segcblist_advance() to advance all those pending
> callbacks to 304.
>
> I posit that probably srcu_invoke_callbacks() is not even being called
> in the first place, otherwise why is the DONE segment count still 0?
> DONE should not be 0, otherwise callbacks will not run. The segcblist
> of the rdp seems out of date perhaps due to rcu_invoke_callbacks() not
> being called due to some other bug (or a workqueue/timer issue causing
> the srcu_invoke_callbacks() to not run?).
>
> I wonder if you would consider trying to reproduce it with the
> rcutorture module's SRCU testing in the chance that it occurs with it
> on your setup.
As an enthusiast, I have an idea, could we use the ftrace mechanism to
log ftrace just before the stall for more detailed analysis.
Recently I used the cyclictest --tracemark option to analyze the
ftrace of the system just before the cyclictest's -b latency is
exceeded. [1]

Thanks
Zhouyi
[1] https://man.archlinux.org/man/cyclictest.8.en
>
> thanks,
>
>  - Joel
>
>
> >                                                       /* [GP2][CPU-017] */
> >                                                       synchronize_srcu {
> >                                                        __call_srcu {
> >                                                         srcu_funnel_gp_start
> >                                                          srcu_gp_start
> >                                                           rcu_seq_start
> >                                                        }
> >                                                       }
> > /* [GP0][CPU-136] */
> >  rcu_segcblist_accelerate
> > }
> >
> >
> > Potential fix:
> > We shouldn't accelerate the segcblist in srcu_invoke_callbacks base snapshot
> > of current gp seq of SRCU struct, how about accelerate the sda base
> > sdp->srcu_gp_seq_needed, maybe like this:
> >
> >
> > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > index 21115ffb6c44..ff8579abc924 100644
> > --- a/kernel/rcu/srcutree.c
> > +++ b/kernel/rcu/srcutree.c
> > @@ -1197,7 +1197,7 @@ static void srcu_invoke_callbacks(struct
> > work_struct *work)
> >         spin_lock_irq_rcu_node(sdp);
> >         rcu_segcblist_insert_count(&sdp->srcu_cblist, &ready_cbs);
> >         (void)rcu_segcblist_accelerate(&sdp->srcu_cblist,
> > -                                      rcu_seq_snap(&ssp->srcu_gp_seq));
> > +                                      rcu_seq_snap(&sdp->srcu_gp_seq_needed));
> >         sdp->srcu_cblist_invoking = false;
> >         more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist);
> >         spin_unlock_irq_rcu_node(sdp);
>
> I think "rcu_seq_snap(&ssp->srcu_gp_seq))" is correct IMHO,
> rcu_seq_snap() says which what's the gp_seq at which a full GP has
> passed from the moment rcu_seq_snap() is called. That's precisely when
> it is safe to call all those non-accelerated callbacks.
>
> Basically, if any of the bits are set in the state portion of the
> sequence of the gp_seq number, then it returns (COUNTER_PORTION + 2)
> << RCU_SEQ_CTR_SHIFT.
> Otherwise, it returns (COUNTER_PORTION + 1) << RCU_SEQ_CTR_SHIFT.
>
>
>
>
>
> >
> >
> > Thanks for reading!
> >
> >
> > Reference detail log of crash dump information:
> > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/bt.f.252684.log
> > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.partial.log
> > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.raw.log
> > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.lock.log
> > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.raw.log