On Thu, Oct 27, 2022 at 11:03:53AM +0800, Pingfan Liu wrote: > On Wed, Oct 26, 2022 at 06:55:52AM -0700, Paul E. McKenney wrote: > > On Wed, Oct 26, 2022 at 04:21:26PM +0800, Pingfan Liu wrote: > > > On Wed, Oct 26, 2022 at 12:36 PM Zhang, Qiang1 <qiang1.zhang@xxxxxxxxx> wrote: > > > > > > > > > > > > > >commit 994f706872e6 ("srcu: Make Tree SRCU able to operate without > > > > >snp_node array") assumes that cpu 0 is always online, but that is not > > > > >the truth when using maxcpus=1 in the command line for the kdump kernel. > > > > > > > > > >On a PowerPC, the following kdump kernel hanging is observed: > > > > >... > > > > >[ 1.740036] systemd[1]: Hostname set to <xyz.com> > > > > >[ 243.686240] INFO: task systemd:1 blocked for more than 122 seconds. > > > > >[ 243.686264] Not tainted 6.1.0-rc1 #1 > > > > >[ 243.686272] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > > >[ 243.686281] task:systemd state:D stack:0 pid:1 ppid:0 flags:0x00042000 > > > > >[ 243.686296] Call Trace: > > > > >[ 243.686301] [c000000016657640] [c000000016657670] 0xc000000016657670 (unreliable) > > > > >[ 243.686317] [c000000016657830] [c00000001001dec0] __switch_to+0x130/0x220 > > > > >[ 243.686333] [c000000016657890] [c000000010f607b8] __schedule+0x1f8/0x580 > > > > >[ 243.686347] [c000000016657940] [c000000010f60bb4] schedule+0x74/0x140 > > > > >[ 243.686361] [c0000000166579b0] [c000000010f699b8] schedule_timeout+0x168/0x1c0 > > > > >[ 243.686374] [c000000016657a80] [c000000010f61de8] __wait_for_common+0x148/0x360 > > > > >[ 243.686387] [c000000016657b20] [c000000010176bb0] __flush_work.isra.0+0x1c0/0x3d0 > > > > >[ 243.686401] [c000000016657bb0] [c0000000105f2768] fsnotify_wait_marks_destroyed+0x28/0x40 > > > > >[ 243.686415] [c000000016657bd0] [c0000000105f21b8] fsnotify_destroy_group+0x68/0x160 > > > > >[ 243.686428] [c000000016657c40] [c0000000105f6500] inotify_release+0x30/0xa0 > > > > >[ 243.686440] [c000000016657cb0] [c0000000105751a8] __fput+0xc8/0x350 > > > > >[ 243.686452] [c000000016657d00] [c00000001017d524] task_work_run+0xe4/0x170 > > > > >[ 243.686464] [c000000016657d50] [c000000010020e94] do_notify_resume+0x134/0x140 > > > > >[ 243.686478] [c000000016657d80] [c00000001002eb18] interrupt_exit_user_prepare_main+0x198/0x270 > > > > >[ 243.686493] [c000000016657de0] [c00000001002ec60] syscall_exit_prepare+0x70/0x180 > > > > >[ 243.686505] [c000000016657e10] [c00000001000bf7c] system_call_vectored_common+0xfc/0x280 > > > > >[ 243.686520] --- interrupt: 3000 at 0x7fffa47d5ba4 > > > > >[ 243.686528] NIP: 00007fffa47d5ba4 LR: 0000000000000000 CTR: 0000000000000000 > > > > >[ 243.686538] REGS: c000000016657e80 TRAP: 3000 Not tainted (6.1.0-rc1) > > > > >[ 243.686548] MSR: 800000000000d033 <SF,EE,PR,ME,IR,DR,RI,LE> CR: 42044440 XER: 00000000 > > > > >[ 243.686572] IRQMASK: 0 > > > > >[ 243.686572] GPR00: 0000000000000006 00007ffffa606710 00007fffa48e7200 0000000000000000 > > > > >[ 243.686572] GPR04: 0000000000000002 000000000000000a 0000000000000000 0000000000000001 > > > > >[ 243.686572] GPR08: 000001000c172dd0 0000000000000000 0000000000000000 0000000000000000 > > > > >[ 243.686572] GPR12: 0000000000000000 00007fffa4ff4bc0 0000000000000000 0000000000000000 > > > > >[ 243.686572] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > > > > >[ 243.686572] GPR20: 0000000132dfdc50 000000000000000e 0000000000189375 0000000000000000 > > > > >[ 243.686572] GPR24: 00007ffffa606ae0 0000000000000005 000001000c185490 000001000c172570 > > > > >[ 243.686572] GPR28: 000001000c172990 000001000c184850 000001000c172e00 00007fffa4fedd98 > > > > >[ 243.686683] NIP [00007fffa47d5ba4] 0x7fffa47d5ba4 > > > > >[ 243.686691] LR [0000000000000000] 0x0 > > > > >[ 243.686698] --- interrupt: 3000 > > > > >[ 243.686708] INFO: task kworker/u16:1:24 blocked for more than 122 seconds. > > > > >[ 243.686717] Not tainted 6.1.0-rc1 #1 > > > > >[ 243.686724] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > > >[ 243.686733] task:kworker/u16:1 state:D stack:0 pid:24 ppid:2 flags:0x00000800 > > > > >[ 243.686747] Workqueue: events_unbound fsnotify_mark_destroy_workfn > > > > >[ 243.686758] Call Trace: > > > > >[ 243.686762] [c0000000166736e0] [c00000004fd91000] 0xc00000004fd91000 (unreliable) > > > > >[ 243.686775] [c0000000166738d0] [c00000001001dec0] __switch_to+0x130/0x220 > > > > >[ 243.686788] [c000000016673930] [c000000010f607b8] __schedule+0x1f8/0x580 > > > > >[ 243.686801] [c0000000166739e0] [c000000010f60bb4] schedule+0x74/0x140 > > > > >[ 243.686814] [c000000016673a50] [c000000010f699b8] schedule_timeout+0x168/0x1c0 > > > > >[ 243.686827] [c000000016673b20] [c000000010f61de8] __wait_for_common+0x148/0x360 > > > > >[ 243.686840] [c000000016673bc0] [c000000010210840] __synchronize_srcu.part.0+0xa0/0xe0 > > > > >[ 243.686855] [c000000016673c30] [c0000000105f2c64] fsnotify_mark_destroy_workfn+0xc4/0x1a0 > > > > >[ 243.686868] [c000000016673ca0] [c000000010174ea8] process_one_work+0x2a8/0x570 > > > > >[ 243.686882] [c000000016673d40] [c000000010175208] worker_thread+0x98/0x5e0 > > > > >[ 243.686895] [c000000016673dc0] [c0000000101828d4] kthread+0x124/0x130 > > > > >[ 243.686908] [c000000016673e10] [c00000001000cd40] ret_from_kernel_thread+0x5c/0x64 > > > > >[ 366.566274] INFO: task systemd:1 blocked for more than 245 seconds. > > > > >[ 366.566298] Not tainted 6.1.0-rc1 #1 > > > > >[ 366.566305] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > > >[ 366.566314] task:systemd state:D stack:0 pid:1 ppid:0 flags:0x00042000 > > > > >[ 366.566329] Call Trace: > > > > >... > > > > > > > > > >In that case, note that maxcpus=1 instead of nr_cpus=1 is used in the > > > > >kernel command line on the PowerPC platform. Consequently, the crash cpu > > > > >is the only onlined cpu in the kdump kernel, but with its logical id not > > > > >necessary 0. While SRCU queues a sdp->work on cpu 0, on which no worker > > > > >thread is created, so sdp->work will be never executed and > > > > >__synchronize_srcu() can not be completed. > > > > > > > > > >Tackle this issue by queueing sdp->work on the first onlined cpu. > > > > > > > > > >Signed-off-by: Pingfan Liu <kernelfans@xxxxxxxxx> > > > > >Cc: "Paul E. McKenney" <paulmck@xxxxxxxxxx> > > > > >Cc: Lai Jiangshan <jiangshanlai@xxxxxxxxx> > > > > >Cc: Josh Triplett <josh@xxxxxxxxxxxxxxxx> > > > > >Cc: Steven Rostedt <rostedt@xxxxxxxxxxx> > > > > >Cc: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> > > > > >To: rcu@xxxxxxxxxxxxxxx > > > > >--- > > > > > kernel/rcu/srcutree.c | 10 ++++++---- > > > > > 1 file changed, 6 insertions(+), 4 deletions(-) > > > > > > > > > >diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > > > >index 1c304fec89c0..f2df561e0995 100644 > > > > >--- a/kernel/rcu/srcutree.c > > > > >+++ b/kernel/rcu/srcutree.c > > > > >@@ -663,7 +663,7 @@ static void srcu_gp_start(struct srcu_struct *ssp) > > > > > int state; > > > > > > > > > > if (smp_load_acquire(&ssp->srcu_size_state) < SRCU_SIZE_WAIT_BARRIER) > > > > >- sdp = per_cpu_ptr(ssp->sda, 0); > > > > >+ sdp = per_cpu_ptr(ssp->sda, cpumask_first(cpu_online_mask)); > > > > > else > > > > > sdp = this_cpu_ptr(ssp->sda); > > > > > lockdep_assert_held(&ACCESS_PRIVATE(ssp, lock)); > > > > >@@ -774,7 +774,8 @@ static void srcu_gp_end(struct srcu_struct *ssp) > > > > > /* Initiate callback invocation as needed. */ > > > > > ss_state = smp_load_acquire(&ssp->srcu_size_state); > > > > > if (ss_state < SRCU_SIZE_WAIT_BARRIER) { > > > > >- srcu_schedule_cbs_sdp(per_cpu_ptr(ssp->sda, 0), cbdelay); > > > > >+ srcu_schedule_cbs_sdp(per_cpu_ptr(ssp->sda, > > > > >+ cpumask_first(cpu_online_mask)), cbdelay); > > > > > } else { > > > > > idx = rcu_seq_ctr(gpseq) % ARRAY_SIZE(snp->srcu_have_cbs); > > > > > srcu_for_each_node_breadth_first(ssp, snp) { > > > > >@@ -1093,7 +1094,7 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp, > > > > > idx = srcu_read_lock(ssp); > > > > > ss_state = smp_load_acquire(&ssp->srcu_size_state); > > > > > if (ss_state < SRCU_SIZE_WAIT_CALL) > > > > >- sdp = per_cpu_ptr(ssp->sda, 0); > > > > >+ sdp = per_cpu_ptr(ssp->sda, cpumask_first(cpu_online_mask)); > > > > > > > > Hi Pingfan > > > > > > > > If CPU0 support hotplug. > > > > For example , in this time cpumask_first(cpu_online_mask) return 0 > > > > > > > > > else > > > > > sdp = raw_cpu_ptr(ssp->sda); > > > > > spin_lock_irqsave_sdp_contention(sdp, &flags); > > > > >@@ -1429,7 +1430,8 @@ void srcu_barrier(struct srcu_struct *ssp) > > > > > > > > > > idx = srcu_read_lock(ssp); > > > > > if (smp_load_acquire(&ssp->srcu_size_state) < SRCU_SIZE_WAIT_BARRIER) > > > > >- srcu_barrier_one_cpu(ssp, per_cpu_ptr(ssp->sda, 0)); > > > > >+ srcu_barrier_one_cpu(ssp, per_cpu_ptr(ssp->sda, > > > > >+ cpumask_first(cpu_online_mask))); > > > > > > > > Due to CPU hotplug, CPU0 have been offline, assume srcu_size_state is still SRCU_SIZE_WAIT_CALL. > > > > in this time, cpumask_first(cpu_online_mask) not return 0, this will cause the srcu barrier to return directly. > > > > > > Thanks for catching this bug. > > > > Indeed, good catch! > > > > > It seems that get_boot_cpu_id() can tackle the whole issue. I will > > > wait for more comments and send out V2. > > > > Does your patch require the CPU indicated by get_boot_cpu_id() to always > > stay online? If so, is that CPU guaranteed to always stay online? > > No, it is just a replacement for cpu 0. > > A qualified candidate should meet two requirements: > -1. worker_pool is initialized, so it has worker thread to dispatch > work. This is satisfied by once the cpu was onlined. > > -2. this cpu id is persistent as Zqiang pointed out. > > For hotplug, it is the same as cpu 0, which is taken care of by the > workqueue infrastructure. > > > I am still asking my question about PowerPC booting with a non-zero CPU. > > After all, if all remaining architectures always boot with CPU 0, then > > perhaps some simplification is in order. > > As my understanding, it is not easy. Let us unfold the discussion in > another thread post by you. As you wish... First, let me see if I understand the options. 1. Take your initial patch, but fixing the issue pointed out by Zhang Qiang. One issue is that cpu_online_mask is not initialized until boot_cpu_init() is invoked. Given that SRCU is used by tracing, we must correctly handle early boot usage. 2. Update your initial patch to fix the issue pointed out by Zhang Qiang and also to use get_boot_cpu_id() as you suggested earlier. Except that the __boot_cpu_id variable read by get_boot_cpu_id() is also initialized by boot_cpu_init(), and is zero up to that point. So this has the same problem as #1 above. 3. As in #2 above, but modify srcu_init() to check for SRCU callbacks queued on CPU 0 when boot_cpu_init() returns non-zero. In this case, move those SRCU callbacks to the srcu_data structure indicated by boot_cpu_init(). 4. Leave the SRCU code alone, but boot your crash kernel with srcutree.big_cpu_lim=0. This would switch all srcu_struct structures to big mode at initialization time. Once this switch completed, CPUs would queue SRCU callbacks on their own srcu_data structure's ->srcu_cblist. I am not optimistic about this one, but could you please try it? 5. Start over with the current SRCU code. Make srcu_delay_timer() and srcu_queue_delayed_work_on() check to see if sdp->cpu is online, and if not, use queue_work() instead of queue_work_on(). There might be other adjustments required, but I am not immediately seeing them. At the moment, #5 looks by far the best to me. But first, are there any options that I am missing? Or, for that matter, am I confused about any of the options that I called out? Thanx, Paul