Re: Hit a hard lockup BUG while started VM with passthrough NIC

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 21 Sep 2017 07:35:20 +0000
"Gonglei (Arei)" <arei.gonglei@xxxxxxxxxx> wrote:

> Hi Paolo, Alex,
> 
> We hit a hard lockup bug in our tests while started VM with passthrough NIC.
> Unfortunately, we didn't get the available vmcore to dig into. And it is quite difficult
> To reproduce, actually, we only hit this problem once.
> Does anyone hit such problem ? Or any idea ?
> (For the complete log, please see the attachment).

This is not an upstream or a RHEL kernel, so I don't have the sources
to do much analysis.  Based on the kernel version number, I'm guessing
this is some derivative of a RHEL-7.2 kernel.  Can it be reproduced on
upstream (this is an upstream list)?  The ghes functions in the dump
might indicate might indicate a hardware error was triggered and the
firmware logs might provide more indication of the problem.  Thanks,

Alex

> [93137.701139] ------------[ cut here ]------------
> [93137.701146] WARNING: at kernel/workqueue.c:2131 process_one_work+0x2e8/0x470()
> [93137.701147] Modules linked in: ext4 jbd2 ixgbe(O) dev_connlimit(O) igb_uio(OE) uio bum(O) ip_set nfnetlink prio(O) nat(O) vport_vxlan(O) openvswitch(O) nf_defrag_ipv6 gre signo_catch(O) hotpatch(OE) kboxdriver(O) ipmi_devintf ipmi_si ipmi_msghandler kbox(O) mlx5_ib i40e(OE) ib_core ib_addr vxlan coretemp crc32_pclmul ip6_udp_tunnel crc32c_intel udp_tunnel ptp kvm_intel(O) ghash_clmulni_intel aesni_intel dca kvm(O) pps_core lrw gf128mul glue_helper sg ablk_helper cryptd pcspkr mlx5_core acpi_cpufreq shpchp acpi_power_meter acpi_pad remote_trigger(O) nf_conntrack_ipv4 nf_defrag_ipv4 vhost_net(O) tun(O) vhost(O) macvtap macvlan vfio_pci irqbypass vfio_iommu_type1 vfio xt_sctp nf_conntrack_proto_sctp nf_nat_proto_sctp nf_nat nf_conntrack sctp libcrc32c ip_tables ext3 mbcache jbd sd_mod crc_t10dif crct10dif_generic
> [93137.701171]  crct10dif_pclmul crct10dif_common mpt3sas raid_class ahci scsi_transport_sas libahci libata dm_mod [last unloaded: ixgbe]
> [93137.701178] CPU: 1 PID: 119983 Comm: kworker/0:7 Tainted: G        W  OE  ---- -------   3.10.0-327.55.58.94_29.x86_64 #1
> [93137.701179] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.19 06/22/2017
> [93137.701183]  0000000000000000 000000005abbce18 ffff880225cb7dd0 ffffffff8164791f
> [93137.701185]  ffff880225cb7e08 ffffffff8107b220 ffff88282066c4a8 ffff880868ed3c90
> [93137.701186]  ffff881ffec16080 ffff881ffec1c000 0000000000000000 ffff880225cb7e18
> [93137.701188]  ffffffff8107b36a ffff880225cb7e60 ffffffff8109de58 ffff881ffec16098
> [93137.701189]  ffff880868ed3c00 ffff881ffec16098 ffff880868ed3cc0 ffff880649b09980
> [93137.701190]  ffff880868ed3c90 ffff881ffec16080 ffff880225cb7ec0 ffffffff8109eabb
> [93137.701191]  ffff880225cb7fd8 0000000000016840 ffff880649b09980 ffff880649b09980
> [93137.701193]  ffff880649b09980 ffff8809065bfd38 ffff880868ed3c90 ffffffff8109e9a0
> [93137.701194]  0000000000000000 0000000000000000 ffff880225cb7f48 ffffffff810a61ff
> [93137.701195]  0000000000000000 ffff880225cb7f68 ffff880868ed3c90 ffff881f00000000
> [93137.701196]  ffff882900000000 ffff880225cb7ef8 ffff880225cb7ef8 ffff881f00000000
> [93137.701198]  ffff881f00000000 ffff880225cb7f18 ffff880225cb7f18 000000005abbce18
> [93137.701199]  ffffffff810a6130 0000000000000000 0000000000000000 ffff8809065bfd38
> [93137.701200]  ffffffff81657b98 0000000000000000 0000000000000000 0000000000000000
> [93137.701202]  0000000000000000 ffff8809065bfd38 ffffffff810a6130 0000000000000000
> [93137.701203]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [93137.701204]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [93137.701205]  ffffffffffffffff 0000000000000000 0000000000000010 0000000000000202
> [93137.701207]  ffff880225cb7f58 0000000000000018
> [93137.701208] Call Trace:
> [93137.701212]  [<ffffffff8164791f>] dump_stack+0x19/0x1b
> [93137.701215]  [<ffffffff8107b220>] warn_slowpath_common+0x70/0xb0
> [93137.701216]  [<ffffffff8107b36a>] warn_slowpath_null+0x1a/0x20
> [93137.701218]  [<ffffffff8109de58>] process_one_work+0x2e8/0x470
> [93137.701219]  [<ffffffff8109eabb>] worker_thread+0x11b/0x400
> [93137.701221]  [<ffffffff8109e9a0>] ? rescuer_thread+0x400/0x400
> [93137.701223]  [<ffffffff810a61ff>] kthread+0xcf/0xe0
> [93137.701224]  [<ffffffff810a6130>] ? kthread_create_on_node+0x140/0x140
> [93137.701227]  [<ffffffff81657b98>] ret_from_fork+0x58/0x90
> [93137.701228]  [<ffffffff810a6130>] ? kthread_create_on_node+0x140/0x140
> [93137.701229] ---[ end trace 22f458263368a87f ]---
> [93139.432279] ixgbe 0000:5e:00.1 eth7: VF Reset msg received from vf 9
> [93139.442787] ixgbe 0000:5e:00.1 eth7: VF 9 requested invalid api version 3
> [93139.444842] vfio-pci 0000:5e:12.3: irq 235 for MSI/MSI-X
> [93141.699745] vfio-pci 0000:5e:11.5: irq 237 for MSI/MSI-X
> [93141.699998] vfio-pci 0000:5e:11.5: irq 238 for MSI/MSI-X
> [93141.701442] vfio-pci 0000:5e:11.5: irq 237 for MSI/MSI-X
> [93141.701584] vfio-pci 0000:5e:11.5: irq 238 for MSI/MSI-X
> [93141.701720] vfio-pci 0000:5e:11.5: irq 242 for MSI/MSI-X
> [93142.279728] kvm [186763]: vcpu0 ignored wrmsr: 0x1c9 data 3
> [93142.279736] kvm [186763]: vcpu0 ignored wrmsr: 0x1a6 data 11
> [93142.279738] kvm [186763]: vcpu0 ignored wrmsr: 0x1a7 data 11
> [93142.279740] kvm [186763]: vcpu0 ignored wrmsr: 0x3f6 data 11
> [93142.279743] kvm [186763]: vcpu0 ignored wrmsr: 0x3f7 data 11
> [93145.421969] ixgbe 0000:5e:00.1 eth7: VF Reset msg received from vf 20
> [93145.432473] ixgbe 0000:5e:00.1 eth7: VF 20 requested invalid api version 3
> [93145.434743] vfio-pci 0000:5e:15.1: irq 243 for MSI/MSI-X
> [93156.975002] vfio-pci 0000:5e:11.7: irq 233 for MSI/MSI-X
> [93156.975280] vfio-pci 0000:5e:11.7: irq 245 for MSI/MSI-X
> [93156.977146] vfio-pci 0000:5e:11.7: irq 233 for MSI/MSI-X
> [93156.977386] vfio-pci 0000:5e:11.7: irq 245 for MSI/MSI-X
> [93156.977654] vfio-pci 0000:5e:11.7: irq 256 for MSI/MSI-X
> [93160.455435] vfio-pci 0000:5e:12.1: irq 252 for MSI/MSI-X
> [93160.455710] vfio-pci 0000:5e:12.1: irq 262 for MSI/MSI-X
> [93160.457794] vfio-pci 0000:5e:12.1: irq 252 for MSI/MSI-X
> [93160.458043] vfio-pci 0000:5e:12.1: irq 262 for MSI/MSI-X
> [93160.458177] vfio-pci 0000:5e:12.1: irq 268 for MSI/MSI-X
> [93171.874151] vfio-pci 0000:5e:11.1: irq 230 for MSI/MSI-X
> [93171.874258] vfio-pci 0000:5e:11.1: irq 271 for MSI/MSI-X
> [93171.875976] vfio-pci 0000:5e:11.1: irq 230 for MSI/MSI-X
> [93171.876078] vfio-pci 0000:5e:11.1: irq 271 for MSI/MSI-X
> [93171.876154] vfio-pci 0000:5e:11.1: irq 272 for MSI/MSI-X
> [93172.575386] vfio-pci 0000:5e:12.3: irq 235 for MSI/MSI-X
> [93172.575518] vfio-pci 0000:5e:12.3: irq 280 for MSI/MSI-X
> [93172.577258] vfio-pci 0000:5e:12.3: irq 235 for MSI/MSI-X
> [93172.577379] vfio-pci 0000:5e:12.3: irq 280 for MSI/MSI-X
> [93172.577450] vfio-pci 0000:5e:12.3: irq 281 for MSI/MSI-X
> [93185.691658] vfio-pci 0000:5e:15.1: irq 243 for MSI/MSI-X
> [93185.691783] vfio-pci 0000:5e:15.1: irq 282 for MSI/MSI-X
> [93185.693707] vfio-pci 0000:5e:15.1: irq 243 for MSI/MSI-X
> [93185.693805] vfio-pci 0000:5e:15.1: irq 282 for MSI/MSI-X
> [93185.693875] vfio-pci 0000:5e:15.1: irq 283 for MSI/MSI-X
> [93599.715862] [sched_delayed] sched: RT throttling activated
> [93692.404740] !!kbox:catch rlock on cpu 16!starting process!!
> [93692.405008] [kbox] num_online_cpus: 63, cur_cpu: 16
> [93692.405009] cpu 16 will send nmi, the cpumask: 0xfffffffffffefffe
> [93692.417396] smp_nmi_call_function is finished. wait: 1, time: -1, timeout: 996500, cpumask: 0x0
> [93692.419042] kbox rlock cpu 16 :sending stop and flush finished
> [93698.011186] CPU: 16 PID: 154635 Comm: CPU 3/KVM Tainted: G        W  OE  ---- -------   3.10.0-327.55.58.94_29.x86_64 #1
> [93698.011188] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.19 06/22/2017
> [93698.011190] task: ffff88280451cc80 ti: ffff88292b160000 task.ti: ffff88292b160000
> [93698.011216] RIP: 0010:[<ffffffffa0589ea6>]  [<ffffffffa0589ea6>] vmx_vcpu_run+0x666/0x750 [kvm_intel]
> [93698.011217] RSP: 0018:ffff88292b163cf0  EFLAGS: 00000082
> [93698.011218] RAX: 0000000080000202 RBX: ffff8825bffdb400 RCX: ffff8825bffdb400
> [93698.011218] RDX: 0000000000000200 RSI: ffff8825bffdb400 RDI: ffff8825bffdb400
> [93698.011218] RBP: ffff88292b163d50 R08: ffff7fffffffffff R09: 0000000000000000
> [93698.011219] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
> [93698.011220] R13: 00007f7202374808 R14: 00007f71c2375000 R15: 000000000000007f
> [93698.011220] FS:  00007ff6f48d7700(0000) GS:ffff883ffca00000(0000) knlGS:ffff88013fcc0000
> [93698.011221] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [93698.011221] CR2: 00007f3903526000 CR3: 0000002808a78000 CR4: 00000000003427e0
> [93698.011222] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [93698.011222] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [93698.011222] Stack:
> [93698.011225]  0000000000000000 ffff8825bffdb400 0000000000000000 ffffffff816585c0
> [93698.011226]  00029516f44e1ac2 ffff8825bffdb400 00000000dde5f3e3 ffff8825bffdb400
> [93698.011226]  0000000000000000 0000000000000003 ffffffffa05855f0 0000000000000010
> [93698.011227]  ffff88292b163dd0 ffffffffa03ea569 000000ef00000000 ffff8825bfa62000
> [93698.011227]  ffff8825c2b2f2d8 ffff88292b163fd8 ffff88280451cc80 ffff88292cae0048
> [93698.011228]  0000000000000001 00000000dde5f3e3 ffffffffa04103b5 ffff8825bffdb400
> [93698.011229]  ffff88292b163fd8 ffff88280451cc80 ffff88292cae0048 0000000000000001
> [93698.011229]  ffff88292b163e18 ffffffffa03f2285 ffffffee7ffbfaff 00000000dde5f3e3
> [93698.011230]  ffff8825bffdb400 ffff8828115bef40 0000000000000000 ffff88279d5be680
> [93698.011231]  ffff88280451cc80 ffff88292b163eb0 ffffffffa03d9b51 0000000000000000
> [93698.011231]  0000000100000e40 0000000000000001 ffff882800f19b90 ffff88280451cc80
> [93698.011232]  ffff882800f19b98 ffff882800f19ba0 0000000000000000 0000000000000000
> [93698.011233]  0000000000000001 ffff8827ff05bc10 00000000dde5f3e3 ffff88279d5be680
> [93698.011233]  ffff883fcceff590 0000000000000000 0000000000000000 0000000000000001
> [93698.011234]  ffff88292b163f28 ffffffff811fdf65 00000000dde5f3e3 ffff8827ff05bc00
> [93698.011234]  0000000000000008 0000000000000002 ffff88292b163f10 ffffffffa03e7dc5
> [93698.011235]  0000000000000000 ffff88292b163f58 00000000dde5f3e3 ffff88279d5be680
> [93698.011236]  000000000000001e 000000000000ae80 0000000000000000 ffff88292b163f78
> [93698.011236]  ffffffff811fe1e1 0000000000000000 0000000102d4d010 00000000dde5f3e3
> [93698.011237]  0000000000000000 0000000002d4d010 000000000000ae80 000000000082b4e0
> [93698.011238]  0000000000000000 00007ff6f48d69f0 ffffffff81657c49 0000000000000246
> [93698.011238]  00007ff7012c7000 00000000000000ff 00000000008456f0 0000000000000010
> [93698.011239]  ffffffffffffffff 0000000000000000 000000000000ae80 000000000000001e
> [93698.011239]  0000000000000010 00007ff6fbdcb7b7 0000000000000033 0000000000000246
> [93698.011240] Call Trace:
> [93698.011246]  [<ffffffff816585c0>] ? irq_entries_start+0x300/0x400
> [93698.011249]  [<ffffffffa05855f0>] ? vmx_inject_irq+0xf0/0xf0 [kvm_intel]
> [93698.011289]  [<ffffffffa03ea569>] vcpu_enter_guest+0x639/0x1160 [kvm]
> [93698.011306]  [<ffffffffa04103b5>] ? kvm_apic_local_deliver+0x65/0x70 [kvm]
> [93698.011314]  [<ffffffffa03f2285>] kvm_arch_vcpu_ioctl_run+0xd5/0x440 [kvm]
> [93698.011319]  [<ffffffffa03d9b51>] kvm_vcpu_ioctl+0x2b1/0x640 [kvm]
> [93698.011322]  [<ffffffff811fdf65>] do_vfs_ioctl+0x2e5/0x4c0
> [93698.011328]  [<ffffffffa03e7dc5>] ? kvm_on_user_return+0x75/0xb0 [kvm]
> [93698.011329]  [<ffffffff811fe1e1>] SyS_ioctl+0xa1/0xc0
> [93698.011331]  [<ffffffff81657c49>] system_call_fastpath+0x16/0x1b
> [93698.011338] Code: 8b 80 3c 02 00 00 a8 10 0f 84 2f fa ff ff e9 12 ff ff ff 66 90 85 c0 0f 89 c1 fc ff ff 48 8b 5d a8 48 89 df e8 1c 2c e5 ff cd 02 <48> 89 df e8 32 2c e5 ff e9 a6 fc ff ff 49 89 f8 49 c1 e8 0d 41 
> [93698.011804] kbox:catch rlock!end process!
> [93698.014242] collected_len = 893933, LOG_BUF_LEN_LOCAL = 1048576
> [93700.363926] !!kbox:catch rlock on cpu 25!starting process!!
> [93700.363926] rlock:another rlock on another cpu 25
> [93700.363928] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 25
> [93700.363933] CPU: 25 PID: 155622 Comm: CPU 3/KVM Tainted: G        W  OE  ---- -------   3.10.0-327.55.58.94_29.x86_64 #1
> [93700.363934] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.19 06/22/2017
> [93700.363936]  ffffffff8187c808 000000001c178ae5 ffff883ffcc45ae8 ffffffff8164791f
> [93700.363937]  ffff883ffcc45b68 ffffffff8164105f 0000000000000010 ffff883ffcc45b78
> [93700.363938]  ffff883ffcc45b18 000000001c178ae5 0000000000000000 0000000000000019
> [93700.363938]  0000000000000003 0000000000000000 ffffffff81cc8bfc ffffffff81cff00d
> [93700.363939]  0000000000000019 0000000000000000 ffff883ffcc45c40 0000000000000000
> [93700.363940]  ffff883ffcc45b80 ffffffff8111e871 ffff883fd07944a8 ffff883ffcc45bf8
> [93700.363940]  ffffffff81162091 000000001c178ae5 ffff883fd07944a8 fffffff9f2462404
> [93700.363941]  00000000f2462404 0000000000000001 ffff883ffcc4bbe0 ffff883ffcc45bf0
> [93700.363941]  000000001c178ae5 0000000000000001 ffff883ffcc4b9e0 ffff883fd07944a8
> [93700.363942]  0000000000000021 ffff883ffcc4bbe0 ffff883ffcc45c08 ffffffff81162b64
> [93700.363943]  ffff883ffcc45e40 ffffffff810327f8 ffff883ffcc4c330 00000064fcc45c36
> [93700.363943]  ffff883ffcc45ef8 00000000fcc45c68 0000000200000000 ffffc90000000000
> [93700.363944]  ffffc90029494800 ffff883ffcc45d00 ffffffffa08689e9 0000000000000000
> [93700.363944]  ffff883ffcc45c88 ffffffff8130ab22 ffffc900294932f8 000000060dba31b0
> [93700.363945]  0000000005080021 ffffffff8130bb13 0000000000000000 0000000000000000
> [93700.363946]  0000003f00000000 3235ffff8130dbf4 000000000001ffff 0000000000000000
> [93700.363946]  0000000000000000 ffffc900294932d8 0000000000001528 ffff883fd44d9f80
> [93700.363947]  00000000000557ec ffff883ffcc45d90 ffffffff813029d1 ffffc900181f0fff
> [93700.363947]  ffffc900181f1000 ffffc900181f0fff ffffc900181f1000 ffffffff81962c98
> [93700.363948]  ffffc900181f0fff ffff881ffe8d8008 00003ffffffff000 ffffc900181f1000
> [93700.363949]  ffffc900181f0fff ffffc900181f1000 0000000000000000 00000000557ec02c
> [93700.363949]  ffff883fcac74534 0000000004000000 ffffc900181f0000 ffff883ffcc45d90
> [93700.363950]  ffffffff811a97c1 ffff883ffcc45de8 ffffffff813a4954 0000000000000014
> [93700.363950]  0000000000000014 0000000000000000 00000001813a27ff ffff883fca557758
> [93700.363951] Call Trace:
> [93700.363960]  <NMI>  [<ffffffff8164791f>] dump_stack+0x19/0x1b
> [93700.363964]  [<ffffffff8164105f>] panic+0xd8/0x214
> [93700.363968]  [<ffffffff8111e871>] watchdog_overflow_callback+0xd1/0xe0
> [93700.363972]  [<ffffffff81162091>] __perf_event_overflow+0xa1/0x250
> [93700.363974]  [<ffffffff81162b64>] perf_event_overflow+0x14/0x20
> [93700.363977]  [<ffffffff810327f8>] intel_pmu_handle_irq+0x1e8/0x470
> [93700.363992]  [<ffffffff8130ab22>] ? put_dec+0x72/0x90
> [93700.363993]  [<ffffffff8130bb13>] ? number.isra.2+0x323/0x360
> [93700.363995]  [<ffffffff813029d1>] ? ioremap_page_range+0x241/0x320
> [93700.363999]  [<ffffffff811a97c1>] ? unmap_kernel_range_noflush+0x11/0x20
> [93700.364003]  [<ffffffff813a4954>] ? ghes_copy_tofrom_phys+0x124/0x210
> [93700.364005]  [<ffffffff813a4ae0>] ? ghes_read_estatus+0xa0/0x190
> [93700.364009]  [<ffffffff81650f9b>] perf_event_nmi_handler+0x2b/0x50
> [93700.364010]  [<ffffffff81650619>] nmi_handle.isra.0+0x69/0xb0
> [93700.364012]  [<ffffffff81650831>] do_nmi+0x1d1/0x410
> [93700.364013]  [<ffffffff8164fa53>] end_repeat_nmi+0x1e/0x2e
> [93700.364027]  [<ffffffffa0589ea6>] ? vmx_vcpu_run+0x666/0x750 [kvm_intel]
> [93700.364029]  [<ffffffffa0589ea6>] ? vmx_vcpu_run+0x666/0x750 [kvm_intel]
> [93700.364031]  [<ffffffffa0589ea6>] ? vmx_vcpu_run+0x666/0x750 [kvm_intel]
> [93700.364033]  <<EOE>>  [<ffffffff81658830>] ? uv_bau_message_intr1+0x80/0x80
> [93700.364035]  [<ffffffffa05855f0>] ? vmx_inject_irq+0xf0/0xf0 [kvm_intel]
> [93700.364074]  [<ffffffffa03ea569>] vcpu_enter_guest+0x639/0x1160 [kvm]
> [93700.364087]  [<ffffffffa04103b5>] ? kvm_apic_local_deliver+0x65/0x70 [kvm]
> [93700.364095]  [<ffffffffa03f2285>] kvm_arch_vcpu_ioctl_run+0xd5/0x440 [kvm]
> [93700.364100]  [<ffffffffa03d9b51>] kvm_vcpu_ioctl+0x2b1/0x640 [kvm]
> [93700.364103]  [<ffffffff810e7b72>] ? do_futex+0x122/0x5b0
> [93700.364105]  [<ffffffff811fdf65>] do_vfs_ioctl+0x2e5/0x4c0
> [93700.364107]  [<ffffffff81235833>] ? anon_inode_getfile+0xd3/0x170
> [93700.364113]  [<ffffffffa03e7dc5>] ? kvm_on_user_return+0x75/0xb0 [kvm]
> [93700.364113]  [<ffffffff811fe1e1>] SyS_ioctl+0xa1/0xc0
> [93700.364115]  [<ffffffff81657c49>] system_call_fastpath+0x16/0x1b
> [93701.422730] Shutting down cpus with NMI
> [93701.608323] rlock even has been record!
> [93701.608324] kbox: notify die begin
> [93701.608324] kbox: no notify die func register. no need to notify
> [93720.962506] kbox rlock callback end!
> 
> 
> Regards,
> -Gonglei
> 
> 




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux