[RFC/PATCH 2] kvm: x86: handle KVM_SET_VCPU_EVENTS/KVM_VCPUEVENT_VALID_SMM properly

"Herongguang (Stephen)" <herongguang.he@xxxxxxxxxx> · Thu, 22 Sep 2016 15:51:25 +0800

After making memory consistent between source and destination (https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg03069.html), there can
still reproduce instruction emulation failure in destination side if migration when VM’s in grub stage:

[2016-09-15 06:29:24] monitor_qapi_event_emit:478 {"timestamp": {"seconds": 1473892164, "microseconds": 99652}, "event": "RESUME"}
KVM internal error. Suberror: 1
emulation failure
EAX=000000b5 EBX=00008fc6 ECX=00005678 EDX=00000000
ESI=000f254b EDI=00000000 EBP=0000f958 ESP=000ee958
EIP=00008000 EFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 00000000 ffffffff 00809300
CS =a000 000a0000 ffffffff 00809300
SS =0000 00000000 ffffffff 00809300
DS =0000 00000000 ffffffff 00809300
FS =0000 00000000 ffffffff 00809300
GS =0000 00000000 ffffffff 00809300
LDT=0000 00000000 0000ffff 00008200
TR =0000 00000000 0000ffff 00008b00
GDT=     000f71a0 00000037
IDT=     00000000 00000000
CR0=00000010 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff <ff> ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[2016-09-15 06:29:24] virtio_set_status:712 virtio-blk device status is 7 that means DRIVER OK
[2016-09-15 06:29:24] monitor_qapi_event_emit:478 {"timestamp": {"seconds": 1473892164, "microseconds": 106738}, "event": "STOP"}


Note CS+EIP=0xA8000, which is within SMRAM, however SMM=0 is confusing! So I found that in kvm_put_vcpu_events, events.flags(KVM_VCPUEVENT_VALID_SMM) is
overwritten by setting events.flags to 0, which is obvious mistaken.

So it comes to patch 3. After patch 3, however it results kvm-kmod crash:
[69328.761479] kvm [51353]: vcpu7, guest rIP: 0xfd31c unhandled wrmsr: 0xd1 data 0
[69337.406083] BUG: unable to handle kernel NULL pointer dereference at           (null)
[69337.414193] IP: [<ffffffffa056a59d>] gfn_to_rmap+0xcd/0xe0 [kvm]
[69337.420357] PGD 0
[69337.422514] Oops: 0000 [#1] SMP
[69337.425783] Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache nfsd auth_rpcgss nfs_acl lockd grace sunrpc xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat xt_conntrack ipt_REJECT nf_reject_ipv4 bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack libcrc32c ipmi_devintf ipmi_si ipmi_msghandler intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw kvm_intel gf128mul iTCO_wdt kvm glue_helper iTCO_vendor_support ablk_helper cryptd pcspkr sg mei_me mei ioatdma shpchp lpc_ich wmi mfd_core i2c_i801 vhost_net tun vhost macvtap macvlan ip_tables ext4 jbd2 mbcache sr_mod sd_mod cdrom isci libsas igb ahci libahci scsi_transport_sas crc32c_intel libata ptp pps_core serio_raw i2c_algo_bit megaraid_sas i2c_co!
re dca d
m_mod vfio_pci irqbypass vfio_virqfd vfio_iommu_type1 vfio
[69337.524020] CPU: 3 PID: 51375 Comm: CPU 0/KVM Not tainted 4.7.0 #1
[69337.530315] Hardware name: Huawei Technologies Co., Ltd. Tecal RH2288 V2-8S/BC11SRSB1, BIOS RMIBV365 09/06/2013
[69337.540644] task: ffff88104bcc8000 ti: ffff881055650000 task.ti: ffff881055650000
[69337.548382] RIP: 0010:[<ffffffffa056a59d>]  [<ffffffffa056a59d>] gfn_to_rmap+0xcd/0xe0 [kvm]
[69337.557105] RSP: 0018:ffff881055653a88  EFLAGS: 00010202
[69337.562541] RAX: ffffc9000b3cf428 RBX: ffff88085624a538 RCX: 000000000000000c
[69337.569788] RDX: ffff88083a6e2bd0 RSI: 00000000000000a7 RDI: 00000000000000a0
[69337.577036] RBP: ffff881055653a88 R08: 000000000000000c R09: ffffc9000b3cf008
[69337.584283] R10: ffffc9000b3cf000 R11: 0000000000000037 R12: ffff8810569d0000
[69337.591530] R13: 0000000000000000 R14: 00000000000000a7 R15: ffff8808d624a538
[69337.598777] FS:  00007faeb900d700(0000) GS:ffff88085e0c0000(0000) knlGS:0000000000000000
[69337.607116] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[69337.612978] CR2: 0000000000000000 CR3: 000000104c203000 CR4: 00000000000426e0
[69337.620225] Stack:
[69337.622369]  ffff881055653af0 ffffffffa056b28d ffffffff00000000 0000000000000001
[69337.630059]  0000000000000001 00007fadb50a7000 0100000000000000 ffff88083a6e2bd0
[69337.637751]  ffff8810569d0000 0000000000000001 000ffffffffff000 ffff880000000000
[69337.645442] Call Trace:
[69337.648029]  [<ffffffffa056b28d>] mmu_set_spte+0x19d/0x280 [kvm]
[69337.654166]  [<ffffffffa056f56b>] __direct_map.part.122+0x19b/0x210 [kvm]
[69337.661078]  [<ffffffffa056fa29>] tdp_page_fault+0x239/0x260 [kvm]
[69337.667385]  [<ffffffffa0568bf0>] kvm_mmu_page_fault+0x60/0xf0 [kvm]
[69337.673856]  [<ffffffffa03f8656>] handle_ept_violation+0x96/0x190 [kvm_intel]
[69337.681106]  [<ffffffffa03ff740>] vmx_handle_exit+0x1f0/0xc30 [kvm_intel]
[69337.688019]  [<ffffffffa055d9ce>] vcpu_enter_guest+0x90e/0x1190 [kvm]
[69337.694585]  [<ffffffffa056444d>] kvm_arch_vcpu_ioctl_run+0xcd/0x3f0 [kvm]
[69337.701581]  [<ffffffffa0549fb5>] kvm_vcpu_ioctl+0x295/0x610 [kvm]
[69337.707880]  [<ffffffff8110b6cc>] ? do_futex+0x11c/0x530
[69337.713314]  [<ffffffff8122e436>] do_vfs_ioctl+0xa6/0x5c0
[69337.718833]  [<ffffffff81003276>] ? do_audit_syscall_entry+0x66/0x70
[69337.725303]  [<ffffffff810037cf>] ? syscall_trace_enter_phase1+0x11f/0x140
[69337.732291]  [<ffffffff8122e9c9>] SyS_ioctl+0x79/0x90
[69337.737470]  [<ffffffff81003b12>] do_syscall_64+0x62/0x110
[69337.743085]  [<ffffffff816c88a1>] entry_SYSCALL64_slow_path+0x25/0x25
[69337.749641] Code: 8b 38 0f b6 52 28 5d 83 e2 0f 83 ea 01 8d 0c d2 48 63 d2 48 8b 44 d0 18 48 d3 ee 48 d3 ef 48 29 fe 48 8d 04 f0 c3 8d 48 01 eb 80 <48> 8b 3c 25 00 00 00 00 31 c0 eb cb 0f 1f 80 00 00 00 00 66 66
[69337.769707] RIP  [<ffffffffa056a59d>] gfn_to_rmap+0xcd/0xe0 [kvm]
[69337.775938]  RSP <ffff881055653a88>
[69337.779552] CR2: 0000000000000000
[69337.783467] ---[ end trace 5350f10b8de91e83 ]---
[69337.843957] Kernel panic - not syncing: Fatal exception
[69337.849430] Kernel Offset: disabled

I found that when kernel crashes, in tdp_page_fault ->__direct_map ->mmu_set_spte ->rmap_add->gfn_to_rmap-> __gfn_to_memslot &__gfn_to_rmap,
in __gfn_to_memslot returned NULL, and __gfn_to_rmap uses this NULL pointer (slot) without checking, which is bad.


After some investigation, I found when kernel crashes, gfn is 0xA7, in SMRAM region again. And at this time, kvm_memslots_for_spte_role is false,
however actually (vcpu->arch.hflags & HF_SMM_MASK) is true!

So I think there is some lacking in kvm-kmod’s kvm_vcpu_ioctl_x86_set_vcpu_events that handles KVM_VCPUEVENT_VALID_SMM. I tried following patch,
it seems works fine.

Do you think this patch is appropriate or not enough? Thanks.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 19f9f9e..f39e839 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3013,8 +3013,10 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
                vcpu->arch.apic->sipi_vector = events->sipi_vector;

        if (events->flags & KVM_VCPUEVENT_VALID_SMM) {
-               if (events->smi.smm)
+               if (events->smi.smm) {
                        vcpu->arch.hflags |= HF_SMM_MASK;
+                       kvm_mmu_reset_context(vcpu);
+               }
                else
                        vcpu->arch.hflags &= ~HF_SMM_MASK;
                vcpu->arch.smi_pending = events->smi.pending;


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html