On 03/02/2020 15:43, Marc Zyngier wrote:
On 2020-02-03 12:56, John Garry wrote:
[...]
Can you trigger it after disabling irqbalance?
No, so tested by killing the irqbalance process and it ran for 25
minutes without issue.
OK, that's interesting.
Can you find you whether irqbalance tries to move an interrupt to an
offlined CPU?
Just putting a trace into git_set_affinity() should be enough.
Hi Marc,
I should have mentioned this already, but this board is the same D06
which I reported had the CPU0 hotplug issue from broken FW, if you remember:
https://lore.kernel.org/linux-arm-kernel/fd70f499-83b4-2fdd-d043-ea9ab8f2c636@xxxxxxxxxx/
That's why I'm not including CPU0 in my hotplug testing. Other CPUs were
fine. And it doesn't look like that issue.
Apart from that, I tried as you suggested, with this change:
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -1143,6 +1143,9 @@ static int gic_set_affinity(struct irq_data *d,
const struct cpumask *mask_val,
int enabled;
u64 val;
+ pr_err("%s irq%d mask_val=%*pbl cpu_online_mask=%*pbl force=%d
cpumask_any_and=%d\n", __func__, d->irq, cpumask_pr_args(mask_val),
cpumask_pr_args(cpu_online_mask), force, pumask_any_and(cpu_online_mask,
mask_val));
if (force)
cpu = cpumask_first(mask_val);
else
@@ -1176,6 +1179,9 @@ static int gic_set_affinity(struct irq_data *d,
const struct cpumask *mask_val,
irq_data_update_effective_affinity(d, cpumask_of(cpu));
+ pr_err("%s1 irq%d mask_val=%*pbl cpu_online_mask=%*pbl force=%d
cpumask_any_and=%d cpu=%d\n", __func__, d->irq,
cpumask_pr_args(mask_val), cpumask_pr_args(cpu_online_mask), force,
cpumask_any_and(cpu_online_mask, mask_val), cpu);
+
return IRQ_SET_MASK_OK_DONE;
}
And see this:
polled 0 ms)
[ 947.551340] GICv3: gic_set_affinity irq5 mask_val=24-47
cpu_online_mask=0,44-95 force=0 cpumask_any_and=44
[ 947.560986] GICv3: gic_set_affinity1 irq5 mask_val=24-47
cpu_online_mask=0,44-95 force=0 cpumask_any_and=44 cpu=44
[ 947.571321] GICv3: gic_set_affinity irq8 mask_val=24-47
cpu_online_mask=0,44-95 force=0 cpumask_any_and=44
[ 947.580963] GICv3: gic_set_affinity1 irq8 mask_val=24-47
cpu_online_mask=0,44-95 force=0 cpumask_any_and=44 cpu=44
[ 947.591581] IRQ 819: no longer affine to CPU43
[ 947.596149] CPU43: shutdown
[ 947.598945] psci: CPU43 killed (polled 0 ms)
[ 947.607029] GICv3: gic_set_affinity irq5 mask_val=29-47
cpu_online_mask=0,44-95 force=0 cpumask_any_and=44
Jobs: 6 (f=6): [R(6)][0.0%][r=0KiB/[ 968.614971] rcu: INFO: rcu_preempt
detected stalls on CPUs/tasks:
[ 968.621062] rcu: 66-...0: (0 ticks this GP)
idle=d9a/1/0x4000000000000000 softirq=3654/3654 fqs=2625
[ 968.630365] (detected by 69, t=5256 jiffies, g=51577, q=1884)
[ 968.636187] Task dump for CPU 66:
[ 968.639490] irqbalance R running task 0 1577 1
0x00000002
[ 968.646527] Call trace:
[ 968.648970] __switch_to+0xbc/0x218
[ 968.652450] irq_do_set_affinity+0x30/0xd0
[ 968.656534] irq_set_affinity_locked+0xc8/0xf0
[ 968.660965] __irq_set_affinity+0x4c/0x80
[ 968.664963] write_irq_affinity.isra.7+0x104/0x120
[ 968.669741] irq_affinity_proc_write+0x1c/0x28
[ 968.674175] proc_reg_write+0x78/0xb8
[ 968.677827] __vfs_write+0x18/0x38
[ 968.681217] vfs_write+0xb4/0x1e0
[ 968.684519] ksys_write+0x68/0xf8
[ 968.687822] __arm64_sys_write+0x18/0x20
[ 968.691733] el0_svc_common.constprop.2+0x64/0x160
[ 968.696511] el0_svc_handler+0x20/0x80
[ 968.700247] el0_sync_handler+0xe4/0x188
[ 968.704157] el0_sync+0x140/0x180
and this on a 2nd run:
[ 215.468476] CPU42: shutdown
[ 215.471272] psci: CPU42 killed (polled 0 ms)
[ 215.723714] GICv3: gic_set_affinity irq5 mask_val=24-47
cpu_online_mask=0,44-95 force=0 cpumask_any_and=44
[ 215.733360] GICv3: gic_set_affinity1 irq5 mask_val=24-47
cpu_online_mask=0,44-95 force=0 cpumask_any_and=44 cpu=44
[ 215.743696] GICv3: gic_set_affinity irq8 mask_val=24-47
cpu_online_mask=0,44-95 force=0 cpumask_any_and=44
[ 215.753338] GICv3: gic_set_affinity1 irq8 mask_val=24-47
cpu_online_mask=0,44-95 force=0 cpumask_any_and=44 cpu=44
[ 215.763835] IRQ 426: no longer affine to CPU43
[ 215.768412] IRQ 819: no longer affine to CPU43
[ 215.773023] CPU43: shutdown
Jobs: 6 (f=6): [R(6)][76.9%][r=13[ 215.775834] psci: CPU43 killed
(polled 0 ms)
[ 216.604779] GICv3: gic_set_affinity irq10 mask_val=53
cpu_online_mask=0,44-95 force=0 cpumask_any_and=53
[ 217.223461] pcieport 0000:00:08.0: can't change power state from
D3cold to D0 (config space inaccessible)
[ 237.615383] rcu: INFO: rcu_preempt detected stalls on
CPUs/tasks:2d:17h:39m:00s]
[ 237.621469] rcu: 58-...0: (1 GPs behind)
idle=b5e/1/0x4000000000000000 softirq=1908/1908 fqs=2626
[ 237.630525] (detected by 44, t=5254 jiffies, g=12137, q=191)
[ 237.636260] Task dump for CPU 58:
[ 237.639563] irqbalance R running task 0 1567 1
0x00000002
[ 237.646599] Call trace:
[ 237.649037] __switch_to+0xbc/0x218
[ 237.652513] 0xffff80001529bd68
[ 239.283412] nvme nvme1: controller is down; will reset:
CSTS=0xffffffff, PCI_STATUS=0xffff
[ 300.635382] rcu: INFO: rcu_preempt detected stalls on
CPUs/tasks:03d:01h:02m:56s]
[ 300.641466] rcu: 58-...0: (1 GPs behind)
idle=b5e/1/0x4000000000000000 softirq=1908/1908 fqs=10503
[ 300.650589] (detected by 44, t=21010 jiffies, g=12137, q=698)
[ 300.656410] Task dump for CPU 58:
[ 300.659712] irqbalance R running task 0 1567 1
0x00000002
[ 300.666747] Call trace:
Info about irq5 and irq10 after booting:
john@ubuntu:~$ ls /proc/irq/5
affinity_hint effective_affinity_list smp_affinity spurious
effective_affinity node smp_affinity_list uart-pl011
john@ubuntu:~$ more /proc/irq/5/smp_affinity_list
24-47
john@ubuntu:~$ ls /proc/irq/10
affinity_hint ehci_hcd:usb1 ohci_hcd:usb3 smp_affinity_list
effective_affinity ehci_hcd:usb2 ohci_hcd:usb4 spurious
effective_affinity_list node smp_affinity
john@ubuntu:~$ more /proc/irq/10/smp_affinity_list
71
Thanks,
John