On Fri, Dec 09, 2022, David Matlack wrote: > On Tue, Nov 29, 2022 at 07:12:35PM +0000, Mingwei Zhang wrote: > > Deprecate BUG() in pte_list_remove() in shadow mmu to avoid crashing a > > physical machine. There are several reasons and motivations to do so: > > > > MMU bug is difficult to discover due to various racing conditions and > > corner cases and thus it extremely hard to debug. The situation gets much > > worse when it triggers the shutdown of a host. Host machine crash might > > eliminates everything including the potential clues for debugging. > > > > From cloud computing service perspective, BUG() or BUG_ON() is probably no > > longer appropriate as the host reliability is top priority. Crashing the > > physical machine is almost never a good option as it eliminates innocent > > VMs and cause service outage in a larger scope. Even worse, if attacker can > > reliably triggers this code by diverting the control flow or corrupting the > > memory, then this becomes vm-of-death attack. This is a huge attack vector > > to cloud providers, as the death of one single host machine is not the end > > of the story. Without manual interferences, a failed cloud job may be > > dispatched to other hosts and continue host crashes until all of them are > > dead. > > My only concern with using KVM_BUG() is whether the machine can keep > running correctly after this warning has been hit. In other words, are > we sure the damage is contained to just this VM? > > If, for example, the KVM_BUG() was triggered by a use-after-free, then > there might be corrupted memory floating around in the machine. > David, Your concern is quite reasonable. But given that both rmap and spte are pointers/data structures managed by individual VMs, i.e., none of them are global pointers, use-after-free is unlikely happening on cross-VM cases. Even if there is, then shuting down those corrupted VMs is feasible here, since pte_list_remove() basically does the checking. > What are some instances where we've seen these BUG_ON()s get triggered? > For those instances, would it actually be safe to just kill the current > VM and keep the rest of the machine running? > > > > > For the above reason, we propose the replacement of BUG() in > > pte_list_remove() with KVM_BUG() to crash just the VM itself. > > How did you test this series? I used a simple test case to test the series: diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 0f6455072055..d4b993b26b96 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -701,7 +701,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, if (fault->nx_huge_page_workaround_enabled) disallowed_hugepage_adjust(fault, *it.sptep, it.level); - base_gfn = fault->gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1); + base_gfn = fault->gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1) - 1; if (it.level == fault->goal_level) break; On the testing machine, I launched a L1 VM and a L2 VM within it. The L2 will trigger the above bug in shadow MMU and I got the following error in L0 kernel dmesg as shown below. L1 and L2 hangs with high CPU usage for a while and after a couple of seconds, the L1 VM dies properly. The machine is still alive and subsequent VM operations are all good (launch/kill). [ 1678.043378] ------------[ cut here ]------------ [ 1678.043381] gfn mismatch under direct page 1041bf (expected 10437e, got 1043be) [ 1678.043386] WARNING: CPU: 4 PID: 23430 at arch/x86/kvm/mmu/mmu.c:737 kvm_mmu_page_set_translation+0x131/0x140 [ 1678.043395] Modules linked in: kvm_intel vfat fat i2c_mux_pca954x i2c_mux spidev cdc_acm xhci_pci xhci_hcd sha3_generic gq(O) [ 1678.043404] CPU: 4 PID: 23430 Comm: VCPU-7 Tainted: G S O 6.1.0-smp-DEV #5 [ 1678.043406] Hardware name: Google LLC Indus/Indus_QC_02, BIOS 30.12.6 02/14/2022 [ 1678.043407] RIP: 0010:kvm_mmu_page_set_translation+0x131/0x140 [ 1678.043411] Code: 0f 44 e0 4c 8b 6b 28 48 89 df 44 89 f6 e8 b7 fb ff ff 48 c7 c7 1b 5a 2f 82 4c 89 e6 4c 89 ea 48 89 c1 4d 89 f8 e8 9f 39 0c 00 <0f> 0b eb ac 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 [ 1678.043413] RSP: 0018:ffff88811ba87918 EFLAGS: 00010246 [ 1678.043415] RAX: 1bdd851636664d00 RBX: ffff888118602e60 RCX: 0000000000000027 [ 1678.043416] RDX: 0000000000000002 RSI: c0000000ffff7fff RDI: ffff8897e0320488 [ 1678.043417] RBP: ffff88811ba87940 R08: 0000000000000000 R09: ffffffff82b2e6f0 [ 1678.043418] R10: 00000000ffff7fff R11: 0000000000000000 R12: ffffffff822e89da [ 1678.043419] R13: 00000000001041bf R14: 00000000000001bf R15: 00000000001043be [ 1678.043421] FS: 00007fee198ec700(0000) GS:ffff8897e0300000(0000) knlGS:0000000000000000 [ 1678.043422] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1678.043424] CR2: 0000000000000000 CR3: 0000001857c34005 CR4: 00000000003726e0 [ 1678.043425] Call Trace: [ 1678.043426] <TASK> [ 1678.043428] __rmap_add+0x8a/0x270 [ 1678.043432] mmu_set_spte+0x250/0x340 [ 1678.043435] ept_fetch+0x8ad/0xc00 [ 1678.043437] ept_page_fault+0x265/0x2f0 [ 1678.043440] kvm_mmu_page_fault+0xfa/0x2d0 [ 1678.043443] handle_ept_violation+0x135/0x2e0 [kvm_intel] [ 1678.043455] ? handle_desc+0x20/0x20 [kvm_intel] [ 1678.043462] __vmx_handle_exit+0x1c3/0x480 [kvm_intel] [ 1678.043468] vmx_handle_exit+0x12/0x40 [kvm_intel] [ 1678.043474] vcpu_enter_guest+0xbb3/0xf80 [ 1678.043477] ? complete_fast_pio_in+0xcc/0x160 [ 1678.043480] kvm_arch_vcpu_ioctl_run+0x3b0/0x770 [ 1678.043481] kvm_vcpu_ioctl+0x52d/0x610 [ 1678.043486] ? kvm_on_user_return+0x46/0xd0 [ 1678.043489] __se_sys_ioctl+0x77/0xc0 [ 1678.043492] __x64_sys_ioctl+0x1d/0x20 [ 1678.043493] do_syscall_64+0x3d/0x80 [ 1678.043497] ? sysvec_apic_timer_interrupt+0x49/0x90 [ 1678.043499] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 1678.043501] RIP: 0033:0x7fee3ebf0347 [ 1678.043503] Code: 5d c3 cc 48 8b 05 f9 2f 07 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 cc cc cc cc cc cc cc cc cc cc b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c9 2f 07 00 f7 d8 64 89 01 48 [ 1678.043505] RSP: 002b:00007fee198e8998 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 1678.043507] RAX: ffffffffffffffda RBX: 0000555308e7e4d0 RCX: 00007fee3ebf0347 [ 1678.043507] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 00000000000000b0 [ 1678.043508] RBP: 00007fee198e89c0 R08: 000055530943d920 R09: 00000000000003fa [ 1678.043509] R10: 0000555307349b00 R11: 0000000000000246 R12: 00000000000000b0 [ 1678.043510] R13: 00005574c1a1de88 R14: 00007fee198e8a27 R15: 0000000000000000 [ 1678.043511] </TASK> [ 1678.043512] ---[ end trace 0000000000000000 ]--- [ 5313.657064] ------------[ cut here ]------------ [ 5313.657067] no rmap for 0000000071a2f138 (many->many) [ 5313.657071] WARNING: CPU: 43 PID: 23398 at arch/x86/kvm/mmu/mmu.c:983 pte_list_remove+0x17a/0x190 [ 5313.657080] Modules linked in: kvm_intel vfat fat i2c_mux_pca954x i2c_mux spidev cdc_acm xhci_pci xhci_hcd sha3_generic gq(O) [ 5313.657088] CPU: 43 PID: 23398 Comm: kvm-nx-lpage-re Tainted: G S W O 6.1.0-smp-DEV #5 [ 5313.657090] Hardware name: Google LLC Indus/Indus_QC_02, BIOS 30.12.6 02/14/2022 [ 5313.657092] RIP: 0010:pte_list_remove+0x17a/0x190 [ 5313.657095] Code: cf e4 01 01 48 c7 c7 4d 3c 32 82 e8 70 5e 0c 00 0f 0b e9 0a ff ff ff c6 05 d4 cf e4 01 01 48 c7 c7 9e de 33 82 e8 56 5e 0c 00 <0f> 0b 84 db 75 c8 e9 ec fe ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 [ 5313.657097] RSP: 0018:ffff88986d5d3c30 EFLAGS: 00010246 [ 5313.657099] RAX: 1ebf71ba511d3100 RBX: 0000000000000000 RCX: 0000000000000027 [ 5313.657101] RDX: 0000000000000002 RSI: c0000000ffff7fff RDI: ffff88afdf3e0488 [ 5313.657102] RBP: ffff88986d5d3c40 R08: 0000000000000000 R09: ffffffff82b2e6f0 [ 5313.657104] R10: 00000000ffff7fff R11: 40000000ffff8a28 R12: 0000000000000000 [ 5313.657105] R13: ffff888118602000 R14: ffffc90020e1e000 R15: ffff88815df33030 [ 5313.657106] FS: 0000000000000000(0000) GS:ffff88afdf3c0000(0000) knlGS:0000000000000000 [ 5313.657107] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5313.657109] CR2: 000017c92b50f1b8 CR3: 000000006f40a001 CR4: 00000000003726e0 [ 5313.657110] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 5313.657111] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 5313.657112] Call Trace: [ 5313.657113] <TASK> [ 5313.657114] drop_spte+0x175/0x180 [ 5313.657117] mmu_page_zap_pte+0xfd/0x130 [ 5313.657119] __kvm_mmu_prepare_zap_page+0x290/0x6e0 [ 5313.657122] ? newidle_balance+0x228/0x3b0 [ 5313.657126] kvm_nx_huge_page_recovery_worker+0x266/0x360 [ 5313.657129] kvm_vm_worker_thread+0x93/0x150 [ 5313.657134] ? kvm_mmu_post_init_vm+0x40/0x40 [ 5313.657136] ? kvm_vm_create_worker_thread+0x120/0x120 [ 5313.657139] kthread+0x10d/0x120 [ 5313.657141] ? kthread_blkcg+0x30/0x30 [ 5313.657142] ret_from_fork+0x1f/0x30 [ 5313.657156] </TASK> [ 5313.657156] ---[ end trace 0000000000000000 ]---