hitting a BUG in kvm kernel module during live migration stress test

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I'm running an x86_64 CentOS 7 kernel (3.10.0-327.28.3.7, slightly modified).  We have not modified the kvm module.
This machine is acting as an OpenStack compute node, using libvirt/qemu.   While doing a live-migration stress test
(where I just repeatedly live migrate instances between two compute nodes) I can reliably trigger the BUG below.

I realize this is CentOS and not mainline, but I was wondering if anyone had seen something like this and fixed it
in mainline.

Based on the Oops, the value of %rdi is 0x1af08000, which is obviously not a valid kernel pointer for x86_64.
I *think* it's choking while dereferencing "_spte_" in the second line of the for_each_rmap_spte() macro, which
is called by __rmap_clear_dirty().  However, I don't have the context to know why this would be corrupted.
Anyone have any ideas?

The initial assembly in __rmap_clear_dirty() looks like this.  The line at 0xffffffffa134b0ea is the one
corresponding to __rmap_clear_dirty+0x4a.  (This is from a different run, so the addresses are slightly different
from the Oops below.)


   0xffffffffa134b0a0:  nopl   0x0(%rax,%rax,1)
   0xffffffffa134b0a5:  push   %rbp
   0xffffffffa134b0a6:  mov    %rsp,%rbp
   0xffffffffa134b0a9:  push   %rbx
   0xffffffffa134b0aa:  sub    $0x18,%rsp
   0xffffffffa134b0ae:  mov    %gs:0x28,%rax
   0xffffffffa134b0b7:  mov    %rax,-0x10(%rbp)
   0xffffffffa134b0bb:  xor    %eax,%eax
   0xffffffffa134b0bd:  mov    (%rsi),%rdi		# rdi is *rmapp
   0xffffffffa134b0c0:  test   %rdi,%rdi		# "if (!rmap)" in rmap_get_first()
   0xffffffffa134b0c3:  je     0xffffffffa134b158
   0xffffffffa134b0c9:  test   $0x1,%dil		# "if (!(rmap & 1))" in rmap_get_first()
   0xffffffffa134b0cd:  je     0xffffffffa134b178
   0xffffffffa134b0d3:  and    $0xfffffffffffffffe,%rdi # "iter->desc = (struct pte_list_desc *)(rmap & ~1ul);" in rmap_get_first()
   0xffffffffa134b0d7:  movl   $0x0,-0x18(%rbp)
   0xffffffffa134b0de:  mov    %rdi,-0x20(%rbp)
   0xffffffffa134b0e2:  mov    (%rdi),%rdi		# "return iter->desc->sptes[iter->pos];" in rmap_get_first()
   0xffffffffa134b0e5:  test   %rdi,%rdi		# bail out if _spte_ is NULL?
   0xffffffffa134b0e8:  je     0xffffffffa134b158
   0xffffffffa134b0ea:  mov    (%rdi),%rdx		# calculating *_spte_ in the 2nd line of for_each_rmap_spte(), I think.  This chokes.
   0xffffffffa134b0ed:  test   $0x1,%dl
   0xffffffffa134b0f0:  je     0xffffffffa134b14f


Here's the BUG/Oops:

2016-10-06T23:57:39.264 controller-1 kernel: alert [260915.368163] BUG: unable to handle kernel paging request at 000000001af08000
2016-10-06T23:57:39.264 controller-1 kernel: alert [260915.376067] IP: [<ffffffffa12cb0ea>] __rmap_clear_dirty+0x4a/0xf0 [kvm]
2016-10-06T23:57:39.264 controller-1 kernel: warning [260915.383590] PGD 0 
2016-10-06T23:57:39.264 controller-1 kernel: warning [260915.385947] Oops: 0000 [#1] PREEMPT SMP 
2016-10-06T23:57:39.264 controller-1 kernel: warning [260915.390461] Modules linked in: target_core_pscsi target_core_file target_core_iblock iscsi_target_mod target_core_mod dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio iptable_raw xt_CHECKSUM xt_connmark iptable_mangle nbd ebtable_filter ebtables igb_uio(OE) uio drbd(OE) libcrc32c nf_log_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_conntrack iptable_filter xt_nat xt_comment xt_multiport iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack virtio_net veth nfsv3 nfs fscache 8021q garp stp mrp llc cls_u32 sch_sfq sch_htb dm_mod iTCO_wdt iTCO_vendor_support ipmi_devintf intel_powerclamp coretemp kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel glue_helper lrw gf128mul ablk_helper cryptd lpc_ich mfd_core mei_me mei i2c_i801 ipmi_si ipmi_msghandler acpi_power_meter wrs_avp(OE) nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ex
t4 jbd2 sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32c_intel ixgbe(OE) ahci libahci libata igb i2c_algo_bit i2c_core dca i40e(OE) vxlan ip6_udp_tunnel udp_tunnel ptp pps_core
2016-10-06T23:57:39.264 controller-1 kernel: warning [260915.503652] CPU: 0 PID: 134880 Comm: qemu-kvm Tainted: G           OE  ------------   3.10.0-327.28.3.7.tis.x86_64 #1
2016-10-06T23:57:39.264 controller-1 kernel: warning [260915.515605] Hardware name: Intel Corporation S2600WT2R/S2600WT2R, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
2016-10-06T23:57:39.264 controller-1 kernel: warning [260915.527559] task: ffff880334019420 ti: ffff8803540cc000 task.ti: ffff8803540cc000
2016-10-06T23:57:39.264 controller-1 kernel: warning [260915.536018] RIP: 0010:[<ffffffffa12cb0ea>]  [<ffffffffa12cb0ea>] __rmap_clear_dirty+0x4a/0xf0 [kvm]
2016-10-06T23:57:39.264 controller-1 kernel: warning [260915.546264] RSP: 0018:ffff8803540cfbb8  EFLAGS: 00010206
2016-10-06T23:57:39.282 controller-1 kernel: warning [260915.552286] RAX: 0000000000000000 RBX: ffff8803572ac000 RCX: 0000000000000000
2016-10-06T23:57:39.282 controller-1 kernel: warning [260915.560357] RDX: 0000000000000ef6 RSI: ffffc9002bab1ea0 RDI: 000000001af08000
2016-10-06T23:57:39.282 controller-1 kernel: warning [260915.568432] RBP: ffff8803540cfbd8 R08: 0000000000000000 R09: 0000000000000001
2016-10-06T23:57:39.282 controller-1 kernel: warning [260915.568433] R10: 000000000000bbe8 R11: 0000000000000001 R12: 0000000000000000
2016-10-06T23:57:39.282 controller-1 kernel: warning [260915.568433] R13: ffffffffa12cb0a0 R14: ffff8803540cffd8 R15: 0000000000000000
2016-10-06T23:57:39.282 controller-1 kernel: warning [260915.568435] FS:  00007fb05e1ff700(0000) GS:ffff88103f200000(0000) knlGS:0000000000000000
2016-10-06T23:57:39.282 controller-1 kernel: warning [260915.568436] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2016-10-06T23:57:39.282 controller-1 kernel: warning [260915.568436] CR2: 000000001af08000 CR3: 0000000359e03000 CR4: 00000000003427f0
2016-10-06T23:57:39.282 controller-1 kernel: warning [260915.568437] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2016-10-06T23:57:39.282 controller-1 kernel: warning [260915.568438] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
2016-10-06T23:57:39.282 controller-1 kernel: warning [260915.568439] Stack:
2016-10-06T23:57:39.282 controller-1 kernel: warning [260915.568446]  0000000000000000 ffff8803540cfbd8 000000004c224769 ffff8803572ac000
2016-10-06T23:57:39.282 controller-1 kernel: warning [260915.568452]  ffff8803540cfc60 ffffffffa12ca9c8 000000008108f82a ffff8803540cfd60
2016-10-06T23:57:39.282 controller-1 kernel: warning [260915.568458]  0000000000000100 000000000001ffff 0000000100000001 00000000000038d4
2016-10-06T23:57:39.282 controller-1 kernel: warning [260915.568459] Call Trace:
2016-10-06T23:57:39.282 controller-1 kernel: warning [260915.568481]  [<ffffffffa12ca9c8>] slot_handle_level_range+0x98/0x110 [kvm]
2016-10-06T23:57:39.282 controller-1 kernel: warning [260915.568495]  [<ffffffffa12cbc04>] kvm_mmu_slot_leaf_clear_dirty+0x54/0x80 [kvm]
2016-10-06T23:57:39.282 controller-1 kernel: warning [260915.568502]  [<ffffffffa14901b7>] vmx_slot_enable_log_dirty+0x17/0x30 [kvm_intel]
2016-10-06T23:57:39.282 controller-1 kernel: warning [260915.568515]  [<ffffffffa12c7778>] kvm_arch_commit_memory_region+0x58/0xe0 [kvm]
2016-10-06T23:57:39.282 controller-1 kernel: warning [260915.568525]  [<ffffffffa12af9eb>] __kvm_set_memory_region+0x60b/0x980 [kvm]
2016-10-06T23:57:39.283 controller-1 kernel: warning [260915.568536]  [<ffffffffa12afd8b>] kvm_set_memory_region+0x2b/0x40 [kvm]
2016-10-06T23:57:39.283 controller-1 kernel: warning [260915.568547]  [<ffffffffa12b02e4>] kvm_vm_ioctl+0x544/0x760 [kvm]
2016-10-06T23:57:39.283 controller-1 kernel: warning [260915.568553]  [<ffffffff811b6295>] do_vfs_ioctl+0x2e5/0x4c0
2016-10-06T23:57:39.283 controller-1 kernel: warning [260915.568558]  [<ffffffff8104b040>] ? __do_page_fault+0x1a0/0x3a0
2016-10-06T23:57:39.283 controller-1 kernel: warning [260915.568562]  [<ffffffff811b6511>] SyS_ioctl+0xa1/0xc0
2016-10-06T23:57:39.283 controller-1 kernel: warning [260915.568566]  [<ffffffff81678f89>] system_call_fastpath+0x16/0x1b
2016-10-06T23:57:39.283 controller-1 kernel: warning [260915.568588] Code: 3e 48 85 ff 0f 84 8f 00 00 00 40 f6 c7 01 0f 84 a5 00 00 00 48 83 e7 fe c7 45 e8 00 00 00 00 48 89 7d e0 48 8b 3f 48 85 ff 74 6e <48> 8b 17 f6 c2 01 74 5d 48 8b 05 37 ae 03 00 48 89 d1 31 db 48 
2016-10-06T23:57:39.283 controller-1 kernel: alert [260915.568600] RIP  [<ffffffffa12cb0ea>] __rmap_clear_dirty+0x4a/0xf0 [kvm]
2016-10-06T23:57:39.283 controller-1 kernel: warning [260915.568601]  RSP <ffff8803540cfbb8>
2016-10-06T23:57:39.283 controller-1 kernel: warning [260915.568601] CR2: 000000001af08000
2016-10-06T23:57:39.284 controller-1 kernel: warning [260915.569212] ---[ end trace 7c95db025611e651 ]---
2016-10-06T23:57:39.284 controller-1 kernel: info [260915.627598] note: qemu-kvm[134880] exited with preempt_count 1


Thanks,
Chris

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux