Excerpts from Sachin Sant's message of June 27, 2021 9:23 pm: > >> On 27-Jun-2021, at 3:36 PM, Nicholas Piggin <npiggin@xxxxxxxxx> wrote: >>> >>> So there's definitely IRQMASK=0 and no MSR[EE]=0 in this frame, which is >>> what the warning was. >>> >>> I'd say either something hasn't set PACA_IRQ_HARD_DIS properly, so EE >>> doesn't get enabled when irqs are restored, or maybe the change to >>> arch_local_irq_restore(). Less likely that the stack got messed up. >>> >>> Can you try run with CONFIG_PPC_IRQ_SOFT_MASK_DEBUG=y ? >> >> Nevermind, I think I've found the problem. Some code runs in the >> implicit soft-mask region without expecting to be masked. Working >> on a fix… > > :-) . I was able to recreate this after few attempts. It seem the warning isn’t > always triggered during boot. I had to run a kernel compile operation after > boot to trigger this warning again. > > In case its helpful here is the additional trace with PPC_IRQ_SOFT_MASK_DEBUG. Thanks. I ended up being able to reproduce as well, quite frequently with some extra debug checks that specifically catch more cases. I've got a few patches under test right now, very stable so far. I'll post them out if they survive a nother hour or two stress testing. The problem is some code (e.g., ret_from_fork) now gets implicitly soft-masked where that was not expecting to be. A masked interrupt might hit, and then when it moves out of the implicit soft-mask region it does not re-enable interrupts. Some types of pending interrupts will clear MSR[EE], and that ends up causing this bug on the next interrupt that happens. Not a wonderful escape :\ thanks for finding it. The fixes aren't too bad, fortunately. Thanks, Nick > > [ 92.106731] ------------[ cut here ]------------ > [ 92.106738] WARNING: CPU: 45 PID: 12757 at arch/powerpc/kernel/irq.c:255 arch_local_irq_restore+0x1d0/0x200 > [ 92.106753] Modules linked in: dm_mod bonding nft_ct nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables libcrc32c nfnetlink sunrpc pseries_rng xts vmx_crypto uio_pdrv_genirq uio sch_fq_codel ip_tables ext4 mbcache jbd2 sd_mod t10_pi sg ibmvscsi ibmveth scsi_transport_srp fuse > [ 92.106828] CPU: 45 PID: 12757 Comm: sh Kdump: loaded Tainted: G W 5.13.0-rc7-next-20210625 #1 > [ 92.106841] NIP: c0000000000164d0 LR: c000000000cedaa8 CTR: 0000000000000000 > [ 92.106849] REGS: c00000008dfeb7e0 TRAP: 0700 Tainted: G W (5.13.0-rc7-next-20210625) > [ 92.106859] MSR: 8000000002823033 <SF,VEC,VSX,FP,ME,IR,DR,RI,LE> CR: 28004222 XER: 00000000 > [ 92.106892] CFAR: c00000000001632c IRQMASK: 0 > GPR00: c000000000ceda98 c00000008dfeba80 c000000002921e00 0000000000000000 > GPR04: 0000000000000000 0000000000000000 0000000000000000 00000000000000ff > GPR08: 0000000000000001 0000000000000000 0000000000000001 0000000000000017 > GPR12: 0000000024004822 c000000007fb9200 000000012efd81d4 000000012ee50000 > GPR16: 0000000000000001 00000100268a0e00 000001002687ec10 0000000114200c40 > GPR20: 00003fffa93f8000 0000000000000000 00003fffa93f9300 000000012efb1988 > GPR24: 000000012ee7fe7c 000000012efccba0 000000012ee50000 c00000008d5d7600 > GPR28: c0000000314c0bc0 c000000040d9f100 c0000008beb5861c 4b72201a3063fe13 > [ 92.107024] NIP [c0000000000164d0] arch_local_irq_restore+0x1d0/0x200 > [ 92.107035] LR [c000000000cedaa8] _raw_spin_unlock_irqrestore+0x88/0xb0 > [ 92.107047] Call Trace: > [ 92.107052] [c00000008dfeba80] [c00000008dfebb50] 0xc00000008dfebb50 (unreliable) > [ 92.107065] [c00000008dfebab0] [238c5bf052df0858] 0x238c5bf052df0858 > [ 92.107076] [c00000008dfebae0] [c0000000008178e8] get_random_u64+0x88/0x100 > [ 92.107090] [c00000008dfebb20] [c000000000020134] arch_randomize_brk+0xb4/0xd8 > [ 92.107105] [c00000008dfebb50] [c0000000005430b0] load_elf_binary+0xe70/0x1220 > [ 92.107119] [c00000008dfebc40] [c00000000047ded0] bprm_execve+0x410/0x800 > [ 92.107132] [c00000008dfebd10] [c00000000047e8ec] do_execveat_common.isra.44+0x21c/0x240 > [ 92.107145] [c00000008dfebd80] [c00000000047e964] sys_execve+0x54/0x70 > [ 92.107157] [c00000008dfebdb0] [c000000000032334] system_call_exception+0x164/0x2e0 > [ 92.107169] [c00000008dfebe10] [c00000000000c464] system_call_common+0xf4/0x258 > [ 92.107185] --- interrupt: c00 at 0x3fff9bb6b8a8 > [ 92.107193] NIP: 00003fff9bb6b8a8 LR: 00003fff9bb6c240 CTR: 0000000000000000 > [ 92.107202] REGS: c00000008dfebe80 TRAP: 0c00 Tainted: G W (5.13.0-rc7-next-20210625) > [ 92.107213] MSR: 800000000000f033 <SF,EE,PR,FP,ME,IR,DR,RI,LE> CR: 28004224 XER: 00000000 > [ 92.107243] IRQMASK: 0 > GPR00: 000000000000000b 00003fffc36a1440 00003fff9bc87300 00000100268a67d0 > GPR04: 0000010026887e50 0000010026882c50 fefefefefefefeff 7f7f7f7f7f7f7f7f > GPR08: 00000100268a67d0 0000000000000000 0000000000000000 0000000000000000 > GPR12: 0000000000000000 00003fff9bce3780 0000000114200db4 0000000000000000 > GPR16: 0000000000000001 00000100268a0e00 000001002687ec10 0000000114200c40 > GPR20: 00000001141dd820 0000000000000000 00000001141dd740 0000000114204358 > GPR24: 0000000114203948 0000010026876454 0000000000000001 0000010026882c50 > GPR28: 0000010026887e50 0000010026882c50 00000100268a67d0 00003fffc36a1440 > [ 92.107369] NIP [00003fff9bb6b8a8] 0x3fff9bb6b8a8 > [ 92.107378] LR [00003fff9bb6c240] 0x3fff9bb6c240 > [ 92.107386] --- interrupt: c00 > [ 92.107393] Instruction dump: > [ 92.107400] 7d2000a6 71298000 40820048 39200000 992d0152 39400000 992d0153 614a8002 > [ 92.107427] 7d410164 4bfffe6c 60000000 60000000 <0fe00000> 4bfffe5c 60000000 60000000 > [ 92.107451] ---[ end trace 5f1d49fb99f3613d ]— > > Complete dmesg log attached. > > Thanks > -Sachin > >