Andy Shevchenko <andriy.shevchenko@xxxxxxxxxxxxxxx> writes: > On Mon, Jan 31, 2022 at 12:30:14PM +0530, Abdul Haleem wrote: >> Greeting's >> >> Today's linux-next kernel failed to boot 5.17.0-rc1-next-20220128 with kernel Oops on PowerVM LPAR >> >> dmesg: >> Started hybrid virtual network scan and config. >> Started VDO volume services. >> Started Dynamic System Tuning Daemon. >> Attempted to run process '/opt/rsct/bin/trspoolmgr' with NULL argv >> Created slice system-systemd\x2dcoredump.slice. >> Started Process Core Dump (PID 3726/UID 0). >> Started Process Core Dump (PID 4032/UID 0). >> Started Process Core Dump (PID 4200/UID 0). >> Started RMC-Resource Monitioring and Control. >> Started Process Core Dump (PID 4319/UID 0). >> rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1 >> rpaphp: Slot [U78D2.001.WZS01DT-P1-C10] registered >> Started Process Core Dump (PID 4687/UID 0). >> Started Process Core Dump (PID 4806/UID 0). >> Started Process Core Dump (PID 4973/UID 0). >> Async-gnnft timeout - hdl=7. >> BUG: Unable to handle kernel data access on read at 0x5deadbeef000012a >> Faulting instruction address: 0xc000000000221f4c >> Oops: Kernel access of bad area, sig: 11 [#1] >> LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries >> Modules linked in: rpadlpar_io rpaphp tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag bonding rfkill sunrpc dm_round_robin dm_multipath dm_mod ocrdma ib_uverbs ib_core pseries_rng xts vmx_crypto uio_pdrv_genirq gf128mul uio sch_fq_codel ext4 mbcache jbd2 sd_mod sg qla2xxx ibmvscsi ibmveth scsi_transport_srp nvme_fc nvme_fabrics nvme_core be2net t10_pi scsi_transport_fc >> CPU: 8 PID: 5782 Comm: mksquashfs Not tainted 5.17.0-rc1-next-20220128-autotest #1 >> NIP: c000000000221f4c LR: c000000000221ee4 CTR: 0000000000000006 >> REGS: c0000001008bb920 TRAP: 0380 Not tainted (5.17.0-rc1-next-20220128-autotest) >> MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24004224 XER: 20040000 >> CFAR: c000000000220df8 IRQMASK: 1 >> GPR00: c000000000221ee4 c0000001008bbbc0 c0000000028d6100 00000000ffff9cc0 >> GPR04: 0000000000000000 00000000000001c0 00000000000001c0 0000000000000000 >> GPR08: 0000000000000000 5deadbeef0000122 c0000001008bbbe8 0000000000000001 >> GPR12: c000000000221d50 c000000007fe6680 00007fffbcf00000 0000000000000100 >> GPR16: 0000000000000100 00007fffbce64420 0000000000000001 0000000000000002 >> GPR20: c000000002912108 5deadbeef0000122 c00000079c008668 0000000000000000 >> GPR24: 0000000000000000 c0000001008bbbe8 00000000ffff9540 c000000002913a00 >> GPR28: 0000000000000000 c0000000a4560950 c00000079c008600 c0000000020e8600 >> NIP [c000000000221f4c] run_timer_softirq+0x1fc/0x7c0 >> LR [c000000000221ee4] run_timer_softirq+0x194/0x7c0 >> Call Trace: >> [c0000001008bbbc0] [c000000000221ee4] run_timer_softirq+0x194/0x7c0 (unreliable) >> [c0000001008bbc90] [c000000000ca7e5c] __do_softirq+0x15c/0x3d0 >> [c0000001008bbd80] [c00000000014f538] irq_exit+0x168/0x1b0 >> [c0000001008bbdb0] [c000000000027184] timer_interrupt+0x1a4/0x3e0 >> [c0000001008bbe10] [c000000000009a08] decrementer_common_virt+0x208/0x210 >> --- interrupt: 900 at 0x7fffbccb3850 >> NIP: 00007fffbccb3850 LR: 00007fffbccb51f8 CTR: 0000000000000000 >> REGS: c0000001008bbe80 TRAP: 0900 Not tainted (5.17.0-rc1-next-20220128-autotest) >> MSR: 800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 42004488 XER: 20040000 >> CFAR: 0000000000000000 IRQMASK: 0 >> GPR00: 0000000000007fff 00007ffe8afde4d0 00007fffbcce7f00 00007ffe74020c50 >> GPR04: 0000000000006bfa 0000000000003cf1 00007ffe740223a0 0000000000000005 >> GPR08: 00007ffe74028fb4 00000000000009a3 0000000000000004 0000000000000039 >> GPR12: 00007ffe740323b0 00007ffe8afe68e0 00007fffbcf00000 00007ffe8a7d0000 >> GPR16: 00007fffbce64410 00007fffbce64420 0000000000000000 00007fffbce60318 >> GPR20: 00007ffe740323b0 00007ffe740423c0 0000000000003fff 0000000000000005 >> GPR24: 0000000000000004 00007fffbccc8058 0000000000000000 000000000000000c >> GPR28: 0000000000000102 0000000000004415 00007ffe7402df8b 00007ffe7402e08d >> NIP [00007fffbccb3850] 0x7fffbccb3850 >> LR [00007fffbccb51f8] 0x7fffbccb51f8 >> --- interrupt: 900 >> Instruction dump: >> 60000000 e9390000 2fa90000 419effc8 ebb90000 fbbe0008 60000000 e93d0000 >> e95d0008 2fa90000 f92a0000 419e0008 <f9490008> 813d0020 fb1d0008 ea9d0018 >> ---[ end trace 0000000000000000 ]--- >> >> The fault instruction points to >> >> # gdb -batch /boot/vmlinuz-5.17.0-rc1-next-20220128-autotest -ex 'list *(0xc000000000221f4c)' >> 0xc000000000221f4c is in run_timer_softirq (./include/linux/list.h:850). >> 845 struct hlist_node *next = n->next; >> 846 struct hlist_node **pprev = n->pprev; >> 847 >> 848 WRITE_ONCE(*pprev, next); >> 849 if (next) >> 850 WRITE_ONCE(next->pprev, pprev); >> 851 } >> 852 >> 853 /** >> 854 * hlist_del - Delete the specified hlist_node from its list > > It's quite likely not a culprit, but the result of some (race?) condition. > Cc'ing to Thomas, maybe he has an idea. The disassembly says we're storing r10 to r9 + 8. If you look at the first register dump, r9 is 5deadbeef0000122 which is: #define LIST_POISON2 ((void *) 0x122 + POISON_POINTER_DELTA) So we seem to be deleting a list entry that's already been deleted? cheers