On Mon, Feb 3, 2020 at 6:49 PM John Garry <john.garry@xxxxxxxxxx> wrote: > > On 03/02/2020 10:30, John Garry wrote: > > On 01/02/2020 11:31, Thomas Gleixner wrote: > >> Marc Zyngier <maz@xxxxxxxxxx> writes: > >>> On Sat, 1 Feb 2020 09:31:17 +0800 > >>> Ming Lei <tom.leiming@xxxxxxxxx> wrote: > >>>> On Sat, Feb 1, 2020 at 2:02 AM John Garry <john.garry@xxxxxxxxxx> > >>>> wrote: > >>>> > >>>> gic_set_affinity shouldn't have switched out, so looks like one gic > >>>> issue. > >>> > >>> Given that gic_set_affinity doesn't sleep, this looks pretty unlikely. > >>> And __irq_set_affinity() holds a spinlock with irq disabled, so I can't > >>> really explain how you'd get there. I've just booted a lockdep enabled > >>> v5.5 on my D05, moved SPIs around (because that's the only way to reach > >>> this code), and nothing caught fire. > >>> > >>> Either the stack trace isn't reliable (when I read things like > >>> "80d:00h:35m:42s" in the trace, I'm a bit suspicious), or CPU hotplug is > >>> doing something really funky here. > >> > >> The hotplug code cannot end up in schedule either and it holds desc lock > >> as normal affinity setting. The other backtrace is more complete, > >> > >> [ 728.741808] rcu: INFO: rcu_preempt detected stalls on > >> CPUs/tasks:80d:00h:35m:42s] > >> [ 728.747895] rcu: 48-...0: (0 ticks this GP) > >> idle=b3e/1/0x4000000000000000 softirq=5548/5548 fqs=2626 > >> [ 728.757197] (detected by 63, t=5255 jiffies, g=40989, q=1890) > >> [ 728.763018] Task dump for CPU 48: > >> [ 728.766321] irqbalance R running task 0 1272 1 > >> 0x00000002 > >> [ 728.773358] Call trace: > >> [ 728.775801] __switch_to+0xbc/0x218 > >> [ 728.779283] gic_set_affinity+0x16c/0x1d8 > >> [ 728.783282] irq_do_set_affinity+0x30/0xd0 > >> [ 728.787365] irq_set_affinity_locked+0xc8/0xf0 > >> [ 728.791796] __irq_set_affinity+0x4c/0x80 > >> [ 728.795794] write_irq_affinity.isra.7+0x104/0x120 > >> [ 728.800572] irq_affinity_proc_write+0x1c/0x28 > >> [ 728.805008] proc_reg_write+0x78/0xb8 > >> [ 728.808660] __vfs_write+0x18/0x38 > >> [ 728.812050] vfs_write+0xb4/0x1e0 > >> [ 728.815352] ksys_write+0x68/0xf8 > >> [ 728.818655] __arm64_sys_write+0x18/0x20 > >> [ 728.822567] el0_svc_common.constprop.2+0x64/0x160 > >> [ 728.827345] el0_svc_handler+0x20/0x80 > >> [ 728.831082] el0_sync_handler+0xe4/0x188 > >> [ 728.834991] el0_sync+0x140/0x180 > >> > >> But the __switch_to() there definitely does not make any sense at all. > > > > I have put a full kernel log here for another run (hang snippet at end), > > including scripts: > > > > https://pastebin.com/N4Gi5it6 > > > > This does not look good: > > > > Jobs: 6 ([ 519.858094] nvme nvme1: controller is down; will reset: > > CSTS=0xffffffff, PCI_STATUS=0xffff > > > > And some NVMe error also coincides with the hang. Another run has this: > > > > [ 247.015206] pcieport 0000:00:08.0: can't change power state from > > D3cold to D0 (config space inaccessible) > > > > I did test v5.4 previously and did not see this, but that would have > > included the v4 patchset in the $subject. I'll test v5.4 without that > > patchset now. > > So v5.4 does have this issue also: v5.5? > > [ 705.669512] psci: CPU24 killed (polled 0 ms) > [ 705.966753] CPU25: shutdown > [ 705.969548] psci: CPU25 killed (polled 0 ms) > [ 706.250771] CPU26: shutdown=2347MiB/s,w=0KiB/s][r=601k,w=0 IOPS][eta > 00m:13s] > [ 706.253565] psci: CPU26 killed (polled 0 ms) > [ 706.514728] CPU27: shutdown > [ 706.517523] psci: CPU27 killed (polled 0 ms) > [ 706.826708] CPU28: shutdown > [ 706.829502] psci: CPU28 killed (polled 0 ms) > [ 707.130916] CPU29: shutdown=2134MiB/s,w=0KiB/s][r=546k,w=0 IOPS][eta > 00m:12s] > [ 707.133714] psci: CPU29 killed (polled 0 ms) > [ 707.439066] CPU30: shutdown > [ 707.441870] psci: CPU30 killed (polled 0 ms) > [ 707.706727] CPU31: shutdown > [ 707.709526] psci: CPU31 killed (polled 0 ms) > [ 708.521853] pcieport 0000:00:08.0: can't change power state from > D3cold to D0 (config space inaccessible) > [ 728.741808] rcu: INFO: rcu_preempt detected stalls on > CPUs/tasks:80d:00h:35m:42s] > [ 728.747895] rcu: 48-...0: (0 ticks this GP) > idle=b3e/1/0x4000000000000000 softirq=5548/5548 fqs=2626 > [ 728.757197] (detected by 63, t=5255 jiffies, g=40989, q=1890) > [ 728.763018] Task dump for CPU 48: > [ 728.766321] irqbalance R running task 0 1272 1 > 0x00000002 > [ 728.773358] Call trace: > [ 728.775801] __switch_to+0xbc/0x218 > [ 728.779283] gic_set_affinity+0x16c/0x1d8 > [ 728.783282] irq_do_set_affinity+0x30/0xd0 > [ 728.787365] irq_set_affinity_locked+0xc8/0xf0 > [ 728.791796] __irq_set_affinity+0x4c/0x80 > [ 728.795794] write_irq_affinity.isra.7+0x104/0x120 > [ 728.800572] irq_affinity_proc_write+0x1c/0x28 > [ 728.805008] proc_reg_write+0x78/0xb8 > [ 728.808660] __vfs_write+0x18/0x38 > [ 728.812050] vfs_write+0xb4/0x1e0 > [ 728.815352] ksys_write+0x68/0xf8 > [ 728.818655] __arm64_sys_write+0x18/0x20 > [ 728.822567] el0_svc_common.constprop.2+0x64/0x160 > [ 728.827345] el0_svc_handler+0x20/0x80 > [ 728.831082] el0_sync_handler+0xe4/0x188 > [ 728.834991] el0_sync+0x140/0x180 > [ 738.993844] nvme nvme1: controller is down; will reset: > CSTS=0xffffffff, PCI_STATUS=0xffff > [ 791.761810] rcu: INFO: rcu_preempt detected stalls on > CPUs/tasks:63d:14h:24m:13s] > [ 791.767897] rcu: 48-...0: (0 ticks this GP) > idle=b3e/1/0x4000000000000000 softirq=5548/5548 fqs=10495 > [ 791.777281] (detected by 54, t=21010 jiffies, g=40989, q=2396) > [ 791.783189] Task dump for CPU 48: > [ 791.786491] irqbalance R running task 0 1272 1 > 0x00000002 > [ 791.793528] Call trace: > [ 791.795964] __switch_to+0xbc/0x218 > [ 791.799441] gic_set_affinity+0x16c/0x1d8 > [ 791.803439] irq_do_set_affinity+0x30/0xd0 > [ 791.807522] irq_set_affinity_locked+0xc8/0xf0 > [ 791.811953] __irq_set_affinity+0x4c/0x80 > [ 791.815949] write_irq_affinity.isra.7+0x104/0x120 > [ 791.820727] irq_affinity_proc_write+0x1c/0x28 > [ 791.825158] proc_reg_write+0x78/0xb8 > [ 791.828808] __vfs_write+0x18/0x38 > [ 791.832197] vfs_write+0xb4/0x1e0 > [ 791.835500] ksys_write+0x68/0xf8 > [ 791.838802] __arm64_sys_write+0x18/0x20 > [ 791.842712] el0_svc_common.constprop.2+0x64/0x160 > [ 791.847490] el0_svc_handler+0x20/0x80 > [ 791.851226] el0_sync_handler+0xe4/0x188 > [ 791.855135] el0_sync+0x140/0x180 > Jobs: 6 (f=6): [R(6)][0.0%][r=0KiB/s Can you trigger it after disabling irqbalance? Thanks, Ming Lei