On Thu, 21 Sep 2017 07:35:20 +0000 "Gonglei (Arei)" <arei.gonglei@xxxxxxxxxx> wrote: > Hi Paolo, Alex, > > We hit a hard lockup bug in our tests while started VM with passthrough NIC. > Unfortunately, we didn't get the available vmcore to dig into. And it is quite difficult > To reproduce, actually, we only hit this problem once. > Does anyone hit such problem ? Or any idea ? > (For the complete log, please see the attachment). This is not an upstream or a RHEL kernel, so I don't have the sources to do much analysis. Based on the kernel version number, I'm guessing this is some derivative of a RHEL-7.2 kernel. Can it be reproduced on upstream (this is an upstream list)? The ghes functions in the dump might indicate might indicate a hardware error was triggered and the firmware logs might provide more indication of the problem. Thanks, Alex > [93137.701139] ------------[ cut here ]------------ > [93137.701146] WARNING: at kernel/workqueue.c:2131 process_one_work+0x2e8/0x470() > [93137.701147] Modules linked in: ext4 jbd2 ixgbe(O) dev_connlimit(O) igb_uio(OE) uio bum(O) ip_set nfnetlink prio(O) nat(O) vport_vxlan(O) openvswitch(O) nf_defrag_ipv6 gre signo_catch(O) hotpatch(OE) kboxdriver(O) ipmi_devintf ipmi_si ipmi_msghandler kbox(O) mlx5_ib i40e(OE) ib_core ib_addr vxlan coretemp crc32_pclmul ip6_udp_tunnel crc32c_intel udp_tunnel ptp kvm_intel(O) ghash_clmulni_intel aesni_intel dca kvm(O) pps_core lrw gf128mul glue_helper sg ablk_helper cryptd pcspkr mlx5_core acpi_cpufreq shpchp acpi_power_meter acpi_pad remote_trigger(O) nf_conntrack_ipv4 nf_defrag_ipv4 vhost_net(O) tun(O) vhost(O) macvtap macvlan vfio_pci irqbypass vfio_iommu_type1 vfio xt_sctp nf_conntrack_proto_sctp nf_nat_proto_sctp nf_nat nf_conntrack sctp libcrc32c ip_tables ext3 mbcache jbd sd_mod crc_t10dif crct10dif_generic > [93137.701171] crct10dif_pclmul crct10dif_common mpt3sas raid_class ahci scsi_transport_sas libahci libata dm_mod [last unloaded: ixgbe] > [93137.701178] CPU: 1 PID: 119983 Comm: kworker/0:7 Tainted: G W OE ---- ------- 3.10.0-327.55.58.94_29.x86_64 #1 > [93137.701179] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.19 06/22/2017 > [93137.701183] 0000000000000000 000000005abbce18 ffff880225cb7dd0 ffffffff8164791f > [93137.701185] ffff880225cb7e08 ffffffff8107b220 ffff88282066c4a8 ffff880868ed3c90 > [93137.701186] ffff881ffec16080 ffff881ffec1c000 0000000000000000 ffff880225cb7e18 > [93137.701188] ffffffff8107b36a ffff880225cb7e60 ffffffff8109de58 ffff881ffec16098 > [93137.701189] ffff880868ed3c00 ffff881ffec16098 ffff880868ed3cc0 ffff880649b09980 > [93137.701190] ffff880868ed3c90 ffff881ffec16080 ffff880225cb7ec0 ffffffff8109eabb > [93137.701191] ffff880225cb7fd8 0000000000016840 ffff880649b09980 ffff880649b09980 > [93137.701193] ffff880649b09980 ffff8809065bfd38 ffff880868ed3c90 ffffffff8109e9a0 > [93137.701194] 0000000000000000 0000000000000000 ffff880225cb7f48 ffffffff810a61ff > [93137.701195] 0000000000000000 ffff880225cb7f68 ffff880868ed3c90 ffff881f00000000 > [93137.701196] ffff882900000000 ffff880225cb7ef8 ffff880225cb7ef8 ffff881f00000000 > [93137.701198] ffff881f00000000 ffff880225cb7f18 ffff880225cb7f18 000000005abbce18 > [93137.701199] ffffffff810a6130 0000000000000000 0000000000000000 ffff8809065bfd38 > [93137.701200] ffffffff81657b98 0000000000000000 0000000000000000 0000000000000000 > [93137.701202] 0000000000000000 ffff8809065bfd38 ffffffff810a6130 0000000000000000 > [93137.701203] 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > [93137.701204] 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > [93137.701205] ffffffffffffffff 0000000000000000 0000000000000010 0000000000000202 > [93137.701207] ffff880225cb7f58 0000000000000018 > [93137.701208] Call Trace: > [93137.701212] [<ffffffff8164791f>] dump_stack+0x19/0x1b > [93137.701215] [<ffffffff8107b220>] warn_slowpath_common+0x70/0xb0 > [93137.701216] [<ffffffff8107b36a>] warn_slowpath_null+0x1a/0x20 > [93137.701218] [<ffffffff8109de58>] process_one_work+0x2e8/0x470 > [93137.701219] [<ffffffff8109eabb>] worker_thread+0x11b/0x400 > [93137.701221] [<ffffffff8109e9a0>] ? rescuer_thread+0x400/0x400 > [93137.701223] [<ffffffff810a61ff>] kthread+0xcf/0xe0 > [93137.701224] [<ffffffff810a6130>] ? kthread_create_on_node+0x140/0x140 > [93137.701227] [<ffffffff81657b98>] ret_from_fork+0x58/0x90 > [93137.701228] [<ffffffff810a6130>] ? kthread_create_on_node+0x140/0x140 > [93137.701229] ---[ end trace 22f458263368a87f ]--- > [93139.432279] ixgbe 0000:5e:00.1 eth7: VF Reset msg received from vf 9 > [93139.442787] ixgbe 0000:5e:00.1 eth7: VF 9 requested invalid api version 3 > [93139.444842] vfio-pci 0000:5e:12.3: irq 235 for MSI/MSI-X > [93141.699745] vfio-pci 0000:5e:11.5: irq 237 for MSI/MSI-X > [93141.699998] vfio-pci 0000:5e:11.5: irq 238 for MSI/MSI-X > [93141.701442] vfio-pci 0000:5e:11.5: irq 237 for MSI/MSI-X > [93141.701584] vfio-pci 0000:5e:11.5: irq 238 for MSI/MSI-X > [93141.701720] vfio-pci 0000:5e:11.5: irq 242 for MSI/MSI-X > [93142.279728] kvm [186763]: vcpu0 ignored wrmsr: 0x1c9 data 3 > [93142.279736] kvm [186763]: vcpu0 ignored wrmsr: 0x1a6 data 11 > [93142.279738] kvm [186763]: vcpu0 ignored wrmsr: 0x1a7 data 11 > [93142.279740] kvm [186763]: vcpu0 ignored wrmsr: 0x3f6 data 11 > [93142.279743] kvm [186763]: vcpu0 ignored wrmsr: 0x3f7 data 11 > [93145.421969] ixgbe 0000:5e:00.1 eth7: VF Reset msg received from vf 20 > [93145.432473] ixgbe 0000:5e:00.1 eth7: VF 20 requested invalid api version 3 > [93145.434743] vfio-pci 0000:5e:15.1: irq 243 for MSI/MSI-X > [93156.975002] vfio-pci 0000:5e:11.7: irq 233 for MSI/MSI-X > [93156.975280] vfio-pci 0000:5e:11.7: irq 245 for MSI/MSI-X > [93156.977146] vfio-pci 0000:5e:11.7: irq 233 for MSI/MSI-X > [93156.977386] vfio-pci 0000:5e:11.7: irq 245 for MSI/MSI-X > [93156.977654] vfio-pci 0000:5e:11.7: irq 256 for MSI/MSI-X > [93160.455435] vfio-pci 0000:5e:12.1: irq 252 for MSI/MSI-X > [93160.455710] vfio-pci 0000:5e:12.1: irq 262 for MSI/MSI-X > [93160.457794] vfio-pci 0000:5e:12.1: irq 252 for MSI/MSI-X > [93160.458043] vfio-pci 0000:5e:12.1: irq 262 for MSI/MSI-X > [93160.458177] vfio-pci 0000:5e:12.1: irq 268 for MSI/MSI-X > [93171.874151] vfio-pci 0000:5e:11.1: irq 230 for MSI/MSI-X > [93171.874258] vfio-pci 0000:5e:11.1: irq 271 for MSI/MSI-X > [93171.875976] vfio-pci 0000:5e:11.1: irq 230 for MSI/MSI-X > [93171.876078] vfio-pci 0000:5e:11.1: irq 271 for MSI/MSI-X > [93171.876154] vfio-pci 0000:5e:11.1: irq 272 for MSI/MSI-X > [93172.575386] vfio-pci 0000:5e:12.3: irq 235 for MSI/MSI-X > [93172.575518] vfio-pci 0000:5e:12.3: irq 280 for MSI/MSI-X > [93172.577258] vfio-pci 0000:5e:12.3: irq 235 for MSI/MSI-X > [93172.577379] vfio-pci 0000:5e:12.3: irq 280 for MSI/MSI-X > [93172.577450] vfio-pci 0000:5e:12.3: irq 281 for MSI/MSI-X > [93185.691658] vfio-pci 0000:5e:15.1: irq 243 for MSI/MSI-X > [93185.691783] vfio-pci 0000:5e:15.1: irq 282 for MSI/MSI-X > [93185.693707] vfio-pci 0000:5e:15.1: irq 243 for MSI/MSI-X > [93185.693805] vfio-pci 0000:5e:15.1: irq 282 for MSI/MSI-X > [93185.693875] vfio-pci 0000:5e:15.1: irq 283 for MSI/MSI-X > [93599.715862] [sched_delayed] sched: RT throttling activated > [93692.404740] !!kbox:catch rlock on cpu 16!starting process!! > [93692.405008] [kbox] num_online_cpus: 63, cur_cpu: 16 > [93692.405009] cpu 16 will send nmi, the cpumask: 0xfffffffffffefffe > [93692.417396] smp_nmi_call_function is finished. wait: 1, time: -1, timeout: 996500, cpumask: 0x0 > [93692.419042] kbox rlock cpu 16 :sending stop and flush finished > [93698.011186] CPU: 16 PID: 154635 Comm: CPU 3/KVM Tainted: G W OE ---- ------- 3.10.0-327.55.58.94_29.x86_64 #1 > [93698.011188] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.19 06/22/2017 > [93698.011190] task: ffff88280451cc80 ti: ffff88292b160000 task.ti: ffff88292b160000 > [93698.011216] RIP: 0010:[<ffffffffa0589ea6>] [<ffffffffa0589ea6>] vmx_vcpu_run+0x666/0x750 [kvm_intel] > [93698.011217] RSP: 0018:ffff88292b163cf0 EFLAGS: 00000082 > [93698.011218] RAX: 0000000080000202 RBX: ffff8825bffdb400 RCX: ffff8825bffdb400 > [93698.011218] RDX: 0000000000000200 RSI: ffff8825bffdb400 RDI: ffff8825bffdb400 > [93698.011218] RBP: ffff88292b163d50 R08: ffff7fffffffffff R09: 0000000000000000 > [93698.011219] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 > [93698.011220] R13: 00007f7202374808 R14: 00007f71c2375000 R15: 000000000000007f > [93698.011220] FS: 00007ff6f48d7700(0000) GS:ffff883ffca00000(0000) knlGS:ffff88013fcc0000 > [93698.011221] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [93698.011221] CR2: 00007f3903526000 CR3: 0000002808a78000 CR4: 00000000003427e0 > [93698.011222] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [93698.011222] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [93698.011222] Stack: > [93698.011225] 0000000000000000 ffff8825bffdb400 0000000000000000 ffffffff816585c0 > [93698.011226] 00029516f44e1ac2 ffff8825bffdb400 00000000dde5f3e3 ffff8825bffdb400 > [93698.011226] 0000000000000000 0000000000000003 ffffffffa05855f0 0000000000000010 > [93698.011227] ffff88292b163dd0 ffffffffa03ea569 000000ef00000000 ffff8825bfa62000 > [93698.011227] ffff8825c2b2f2d8 ffff88292b163fd8 ffff88280451cc80 ffff88292cae0048 > [93698.011228] 0000000000000001 00000000dde5f3e3 ffffffffa04103b5 ffff8825bffdb400 > [93698.011229] ffff88292b163fd8 ffff88280451cc80 ffff88292cae0048 0000000000000001 > [93698.011229] ffff88292b163e18 ffffffffa03f2285 ffffffee7ffbfaff 00000000dde5f3e3 > [93698.011230] ffff8825bffdb400 ffff8828115bef40 0000000000000000 ffff88279d5be680 > [93698.011231] ffff88280451cc80 ffff88292b163eb0 ffffffffa03d9b51 0000000000000000 > [93698.011231] 0000000100000e40 0000000000000001 ffff882800f19b90 ffff88280451cc80 > [93698.011232] ffff882800f19b98 ffff882800f19ba0 0000000000000000 0000000000000000 > [93698.011233] 0000000000000001 ffff8827ff05bc10 00000000dde5f3e3 ffff88279d5be680 > [93698.011233] ffff883fcceff590 0000000000000000 0000000000000000 0000000000000001 > [93698.011234] ffff88292b163f28 ffffffff811fdf65 00000000dde5f3e3 ffff8827ff05bc00 > [93698.011234] 0000000000000008 0000000000000002 ffff88292b163f10 ffffffffa03e7dc5 > [93698.011235] 0000000000000000 ffff88292b163f58 00000000dde5f3e3 ffff88279d5be680 > [93698.011236] 000000000000001e 000000000000ae80 0000000000000000 ffff88292b163f78 > [93698.011236] ffffffff811fe1e1 0000000000000000 0000000102d4d010 00000000dde5f3e3 > [93698.011237] 0000000000000000 0000000002d4d010 000000000000ae80 000000000082b4e0 > [93698.011238] 0000000000000000 00007ff6f48d69f0 ffffffff81657c49 0000000000000246 > [93698.011238] 00007ff7012c7000 00000000000000ff 00000000008456f0 0000000000000010 > [93698.011239] ffffffffffffffff 0000000000000000 000000000000ae80 000000000000001e > [93698.011239] 0000000000000010 00007ff6fbdcb7b7 0000000000000033 0000000000000246 > [93698.011240] Call Trace: > [93698.011246] [<ffffffff816585c0>] ? irq_entries_start+0x300/0x400 > [93698.011249] [<ffffffffa05855f0>] ? vmx_inject_irq+0xf0/0xf0 [kvm_intel] > [93698.011289] [<ffffffffa03ea569>] vcpu_enter_guest+0x639/0x1160 [kvm] > [93698.011306] [<ffffffffa04103b5>] ? kvm_apic_local_deliver+0x65/0x70 [kvm] > [93698.011314] [<ffffffffa03f2285>] kvm_arch_vcpu_ioctl_run+0xd5/0x440 [kvm] > [93698.011319] [<ffffffffa03d9b51>] kvm_vcpu_ioctl+0x2b1/0x640 [kvm] > [93698.011322] [<ffffffff811fdf65>] do_vfs_ioctl+0x2e5/0x4c0 > [93698.011328] [<ffffffffa03e7dc5>] ? kvm_on_user_return+0x75/0xb0 [kvm] > [93698.011329] [<ffffffff811fe1e1>] SyS_ioctl+0xa1/0xc0 > [93698.011331] [<ffffffff81657c49>] system_call_fastpath+0x16/0x1b > [93698.011338] Code: 8b 80 3c 02 00 00 a8 10 0f 84 2f fa ff ff e9 12 ff ff ff 66 90 85 c0 0f 89 c1 fc ff ff 48 8b 5d a8 48 89 df e8 1c 2c e5 ff cd 02 <48> 89 df e8 32 2c e5 ff e9 a6 fc ff ff 49 89 f8 49 c1 e8 0d 41 > [93698.011804] kbox:catch rlock!end process! > [93698.014242] collected_len = 893933, LOG_BUF_LEN_LOCAL = 1048576 > [93700.363926] !!kbox:catch rlock on cpu 25!starting process!! > [93700.363926] rlock:another rlock on another cpu 25 > [93700.363928] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 25 > [93700.363933] CPU: 25 PID: 155622 Comm: CPU 3/KVM Tainted: G W OE ---- ------- 3.10.0-327.55.58.94_29.x86_64 #1 > [93700.363934] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.19 06/22/2017 > [93700.363936] ffffffff8187c808 000000001c178ae5 ffff883ffcc45ae8 ffffffff8164791f > [93700.363937] ffff883ffcc45b68 ffffffff8164105f 0000000000000010 ffff883ffcc45b78 > [93700.363938] ffff883ffcc45b18 000000001c178ae5 0000000000000000 0000000000000019 > [93700.363938] 0000000000000003 0000000000000000 ffffffff81cc8bfc ffffffff81cff00d > [93700.363939] 0000000000000019 0000000000000000 ffff883ffcc45c40 0000000000000000 > [93700.363940] ffff883ffcc45b80 ffffffff8111e871 ffff883fd07944a8 ffff883ffcc45bf8 > [93700.363940] ffffffff81162091 000000001c178ae5 ffff883fd07944a8 fffffff9f2462404 > [93700.363941] 00000000f2462404 0000000000000001 ffff883ffcc4bbe0 ffff883ffcc45bf0 > [93700.363941] 000000001c178ae5 0000000000000001 ffff883ffcc4b9e0 ffff883fd07944a8 > [93700.363942] 0000000000000021 ffff883ffcc4bbe0 ffff883ffcc45c08 ffffffff81162b64 > [93700.363943] ffff883ffcc45e40 ffffffff810327f8 ffff883ffcc4c330 00000064fcc45c36 > [93700.363943] ffff883ffcc45ef8 00000000fcc45c68 0000000200000000 ffffc90000000000 > [93700.363944] ffffc90029494800 ffff883ffcc45d00 ffffffffa08689e9 0000000000000000 > [93700.363944] ffff883ffcc45c88 ffffffff8130ab22 ffffc900294932f8 000000060dba31b0 > [93700.363945] 0000000005080021 ffffffff8130bb13 0000000000000000 0000000000000000 > [93700.363946] 0000003f00000000 3235ffff8130dbf4 000000000001ffff 0000000000000000 > [93700.363946] 0000000000000000 ffffc900294932d8 0000000000001528 ffff883fd44d9f80 > [93700.363947] 00000000000557ec ffff883ffcc45d90 ffffffff813029d1 ffffc900181f0fff > [93700.363947] ffffc900181f1000 ffffc900181f0fff ffffc900181f1000 ffffffff81962c98 > [93700.363948] ffffc900181f0fff ffff881ffe8d8008 00003ffffffff000 ffffc900181f1000 > [93700.363949] ffffc900181f0fff ffffc900181f1000 0000000000000000 00000000557ec02c > [93700.363949] ffff883fcac74534 0000000004000000 ffffc900181f0000 ffff883ffcc45d90 > [93700.363950] ffffffff811a97c1 ffff883ffcc45de8 ffffffff813a4954 0000000000000014 > [93700.363950] 0000000000000014 0000000000000000 00000001813a27ff ffff883fca557758 > [93700.363951] Call Trace: > [93700.363960] <NMI> [<ffffffff8164791f>] dump_stack+0x19/0x1b > [93700.363964] [<ffffffff8164105f>] panic+0xd8/0x214 > [93700.363968] [<ffffffff8111e871>] watchdog_overflow_callback+0xd1/0xe0 > [93700.363972] [<ffffffff81162091>] __perf_event_overflow+0xa1/0x250 > [93700.363974] [<ffffffff81162b64>] perf_event_overflow+0x14/0x20 > [93700.363977] [<ffffffff810327f8>] intel_pmu_handle_irq+0x1e8/0x470 > [93700.363992] [<ffffffff8130ab22>] ? put_dec+0x72/0x90 > [93700.363993] [<ffffffff8130bb13>] ? number.isra.2+0x323/0x360 > [93700.363995] [<ffffffff813029d1>] ? ioremap_page_range+0x241/0x320 > [93700.363999] [<ffffffff811a97c1>] ? unmap_kernel_range_noflush+0x11/0x20 > [93700.364003] [<ffffffff813a4954>] ? ghes_copy_tofrom_phys+0x124/0x210 > [93700.364005] [<ffffffff813a4ae0>] ? ghes_read_estatus+0xa0/0x190 > [93700.364009] [<ffffffff81650f9b>] perf_event_nmi_handler+0x2b/0x50 > [93700.364010] [<ffffffff81650619>] nmi_handle.isra.0+0x69/0xb0 > [93700.364012] [<ffffffff81650831>] do_nmi+0x1d1/0x410 > [93700.364013] [<ffffffff8164fa53>] end_repeat_nmi+0x1e/0x2e > [93700.364027] [<ffffffffa0589ea6>] ? vmx_vcpu_run+0x666/0x750 [kvm_intel] > [93700.364029] [<ffffffffa0589ea6>] ? vmx_vcpu_run+0x666/0x750 [kvm_intel] > [93700.364031] [<ffffffffa0589ea6>] ? vmx_vcpu_run+0x666/0x750 [kvm_intel] > [93700.364033] <<EOE>> [<ffffffff81658830>] ? uv_bau_message_intr1+0x80/0x80 > [93700.364035] [<ffffffffa05855f0>] ? vmx_inject_irq+0xf0/0xf0 [kvm_intel] > [93700.364074] [<ffffffffa03ea569>] vcpu_enter_guest+0x639/0x1160 [kvm] > [93700.364087] [<ffffffffa04103b5>] ? kvm_apic_local_deliver+0x65/0x70 [kvm] > [93700.364095] [<ffffffffa03f2285>] kvm_arch_vcpu_ioctl_run+0xd5/0x440 [kvm] > [93700.364100] [<ffffffffa03d9b51>] kvm_vcpu_ioctl+0x2b1/0x640 [kvm] > [93700.364103] [<ffffffff810e7b72>] ? do_futex+0x122/0x5b0 > [93700.364105] [<ffffffff811fdf65>] do_vfs_ioctl+0x2e5/0x4c0 > [93700.364107] [<ffffffff81235833>] ? anon_inode_getfile+0xd3/0x170 > [93700.364113] [<ffffffffa03e7dc5>] ? kvm_on_user_return+0x75/0xb0 [kvm] > [93700.364113] [<ffffffff811fe1e1>] SyS_ioctl+0xa1/0xc0 > [93700.364115] [<ffffffff81657c49>] system_call_fastpath+0x16/0x1b > [93701.422730] Shutting down cpus with NMI > [93701.608323] rlock even has been record! > [93701.608324] kbox: notify die begin > [93701.608324] kbox: no notify die func register. no need to notify > [93720.962506] kbox rlock callback end! > > > Regards, > -Gonglei > >