On Fri, Sep 29, 2023 at 3:04 AM zhuangel570 <zhuangel570@xxxxxxxxx> wrote: > > Hi, > > We encounter SRCU hung issue in stable tree 5.4.203, we are running VM create > and destroy concurrent test, the issue happens after several weeks. Now we > didn't have a way to reproduce this issue, the issue happens randomly, this > is the second time we found it in this year. > > We did some investigation with crash tools, hope to get some useful clues > about the issue, try to find root cause of the issue (further find a way to > fix or workaround it), here is some of our investigation, hope to get > feedback from SRCU experts. > > [3144162.608392] INFO: task kworker/136:12:252684 blocked for more > than 122 seconds. > [3144162.615986] Tainted: G O K 5.4.203-1-tlinux4-0011.1 #1 > [3144162.623053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [3144162.631162] kworker/136:12 D 0 252684 2 0x90004000 > [3144162.631189] Workqueue: kvm-irqfd-cleanup irqfd_shutdown [kvm] > [3144162.631192] Call Trace: > [3144162.631202] __schedule+0x2ee/0x660 > [3144162.631206] schedule+0x33/0xa0 > [3144162.631209] schedule_timeout+0x1c4/0x340 > [3144162.631214] ? update_load_avg+0x82/0x660 > [3144162.631217] ? raw_spin_rq_lock_nested+0x1f/0x30 > [3144162.631218] wait_for_completion+0x119/0x180 > [3144162.631220] ? wake_up_q+0x80/0x80 > [3144162.631224] __synchronize_srcu.part.19+0x81/0xb0 > [3144162.631226] ? __bpf_trace_rcu_utilization+0x10/0x10 > [3144162.631227] synchronize_srcu+0x5f/0xc0 > [3144162.631236] irqfd_shutdown+0x3c/0xb0 [kvm] > [3144162.631239] ? __schedule+0x2f6/0x660 > [3144162.631243] process_one_work+0x19a/0x3a0 > [3144162.631244] worker_thread+0x37/0x3a0 > [3144162.631247] kthread+0x117/0x140 > [3144162.631247] ? process_one_work+0x3a0/0x3a0 > [3144162.631248] ? __kthread_cancel_work+0x40/0x40 > [3144162.631250] ret_from_fork+0x1f/0x30 > > Test workload: > - Repeated concurrency testing, include Create VM, service probe, destroy VM. > - VM using virito device, so kvm->irq_srcu is using frequently. > - x86_64 Machine with 191 core. > > Crash analysis clues: > - The hung task is run on CPU 136, waiting synchronize_srcu to be completed. > - The current GP of SRCU struct is 304 (srcu_gp_seq and srcu_gp_seq_needed). > - The SNP (node) of SRCU struct has 2 levels. > - The SNP (node) of SRCU struct show all SNP now in GP 304 (srcu_have_cbs). > - The SNP (node) of SRCU struct show CPU 136 belongs to group [128,143]. > - The SDP (per-cpu) of SRCU struct show lock and unlock matches. > - The SDP (per-cpu) of CPU 136 segcblist has one entry in pending list. > - The pending entry is just the calling entry of hung kwoker task. > - The SDP (per-cpu) of CPU 136 is waiting GP 288, gp_seq of segcblist is 296. > - The SDP (per-cpu) of CPU 136 segcblist has no entry in DONE list. > - The work-queue of SRCU and SDP of CPU 136 is idle. > - More details could be found via reference at the end of this mail. > > crash> struct srcu_struct.srcu_idx,srcu_gp_seq,srcu_gp_seq_needed,srcu_gp_seq_needed_exp,sda > 0xffffc900f100e878 > srcu_idx = 76 > srcu_gp_seq = 304 > srcu_gp_seq_needed = 304 > srcu_gp_seq_needed_exp = 304 > sda = 0x5fff9ac17800 > > crash> struct srcu_data 0x5fff9ac17800:136 > [136]: ffffe8bee9817800 > ... > srcu_cblist = { > head = 0xffffc900e8413dd0, > tails = {0xffffe8bee9817848, 0xffffc900e8413dd0, > 0xffffc900e8413dd0, 0xffffc900e8413dd0}, > gp_seq = {0, 296, 296, 0}, > len = 1, > len_lazy = 0, > enabled = 1 '\001', > offloaded = 0 '\000' > }, > srcu_gp_seq_needed = 288, > srcu_gp_seq_needed_exp = 0, > srcu_cblist_invoking = false, > ... > > Root cause (we can't reproduce the issue, so the result is base on crash > analysis and code analysis): > - The leaked entry on segcblist has incorrect gp_seq (SDP is waiting 288). > - The leaked entry on segcblist seems has been "accelerated". > - The incorrect accelerate maybe caused by srcu_invoke_callbacks, because > srcu_invoke_callbacks may be using incorrect GP seq doing accelerate. > - Here are two emulated case may cause the "accelerate issue": > > > CASE1: entry of CPU 136 belongs to GP 288 was accelerated to GP 292 > - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks. > - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry > to SDP. > - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run > faster than CPU 136, start GP, set rcu_seq_start. > - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need > to start GP. > - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks, > "accelerate" the seq of new added entry to 292 (it should be 288). > > > CASE2: entry of CPU 136 belongs to GP 288 was accelerated to GP 296 > - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks. > - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry > to SDP. > - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run > faster than CPU 136, start GP. > - [CPU 041] [GP 284-288] finished, workload run faster, finish GP. > - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need > to start GP. > - [CPU 017] [GP 288-292] starting, new synchronize_srcu request, start GP, > set rcu_seq_start. > - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks, > "accelerate" new added entry to 296 (it should be 288). > > > /* > * CASE1 > * - entry of CPU 136 expecting GP 288 was accelerated to GP 292 > * - GP0: 280-284, GP1: 284-288. > */ > > /* [GP0][CPU-136] */ > process_srcu { > srcu_gp_end > } > /* [GP1][CPU-136] */ > synchronize_srcu { > __call_srcu { > rcu_segcblist_enqueue > /* [GP1][CPU-041] */ > synchronize_srcu { > __call_srcu { > srcu_funnel_gp_start > srcu_gp_start > rcu_seq_start > } > } > /* [GP1][CPU-136] */ > srcu_funnel_gp_start > } > } > /* [GP0][CPU-136] */ > srcu_invoke_callbacks { > rcu_segcblist_accelerate > } > > > /* > * CASE2 > * - entry of CPU 136 belongs to GP 288 was accelerated to GP 296. > * - GP0: 280-284, GP1: 284-288, GP2: 288-292. > */ > > /* [GP0][CPU-136] */ > process_srcu { > srcu_gp_end > } > > /* [GP1][CPU-136] */ > synchronize_srcu { > __call_srcu { > rcu_segcblist_enqueue > /* [GP1][CPU-041] */ > synchronize_srcu { > __call_srcu { > srcu_funnel_gp_start > srcu_gp_start > } > } > process_srcu { > srcu_gp_end > rcu_seq_end > } > /* [GP1][CPU-136] */ > srcu_funnel_gp_start > } > } > /* [GP0][CPU-136] */ > srcu_invoke_callbacks { > /* [GP2][CPU-017] */ > synchronize_srcu { > __call_srcu { > srcu_funnel_gp_start > srcu_gp_start > rcu_seq_start > } > } > /* [GP0][CPU-136] */ > rcu_segcblist_accelerate > } > > > Potential fix: > We shouldn't accelerate the segcblist in srcu_invoke_callbacks base snapshot > of current gp seq of SRCU struct, how about accelerate the sda base > sdp->srcu_gp_seq_needed, maybe like this: > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > index 21115ffb6c44..ff8579abc924 100644 > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -1197,7 +1197,7 @@ static void srcu_invoke_callbacks(struct > work_struct *work) > spin_lock_irq_rcu_node(sdp); > rcu_segcblist_insert_count(&sdp->srcu_cblist, &ready_cbs); > (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, > - rcu_seq_snap(&ssp->srcu_gp_seq)); > + rcu_seq_snap(&sdp->srcu_gp_seq_needed)); Can you please explain why rcu_seq_snap() is needed here? Maybe you also need to advance cbs, to ensure that following call to rcu_segcblist_ready_cbs() picks these callbacks, in case srcu_invoke_callbacks() is called after multiple GPs have elapsed. rcu_segcblist_advance(&sdp->srcu_cblist, rcu_seq_current(&ssp->srcu_gp_seq)); ... more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist); Thanks Neeraj > sdp->srcu_cblist_invoking = false; > more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist); > spin_unlock_irq_rcu_node(sdp); > > > Thanks for reading! > > > Reference detail log of crash dump information: > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/bt.f.252684.log > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.partial.log > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.raw.log > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.lock.log > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.raw.log