On Mon, Dec 12, 2022, Mingwei Zhang wrote: > On Fri, Dec 09, 2022, David Matlack wrote: > > On Tue, Nov 29, 2022 at 07:12:35PM +0000, Mingwei Zhang wrote: > > > Deprecate BUG() in pte_list_remove() in shadow mmu to avoid crashing a > > > physical machine. There are several reasons and motivations to do so: > > > > > > MMU bug is difficult to discover due to various racing conditions and > > > corner cases and thus it extremely hard to debug. The situation gets much > > > worse when it triggers the shutdown of a host. Host machine crash might > > > eliminates everything including the potential clues for debugging. > > > > > > From cloud computing service perspective, BUG() or BUG_ON() is probably no > > > longer appropriate as the host reliability is top priority. Crashing the > > > physical machine is almost never a good option as it eliminates innocent > > > VMs and cause service outage in a larger scope. Even worse, if attacker can > > > reliably triggers this code by diverting the control flow or corrupting the > > > memory, then this becomes vm-of-death attack. This is a huge attack vector > > > to cloud providers, as the death of one single host machine is not the end > > > of the story. Without manual interferences, a failed cloud job may be > > > dispatched to other hosts and continue host crashes until all of them are > > > dead. > > > > My only concern with using KVM_BUG() is whether the machine can keep > > running correctly after this warning has been hit. In other words, are > > we sure the damage is contained to just this VM? Hmm, good point. The counter-argument is that KVM doesn't BUG() in get_mmio_spte() when a non-MMIO SPTE has reserved bits set, and as we've seen internally in multiple splats where the reserved bits appear to be set by stack overflow, that has a much, much higher probability of being a symptom of data corruption. That said, that's more of a reason to change get_mmio_spte() than it is to ignore potential data corruption in this case. KVM arguably should kill the VM if get_mmio_spte() fails too. What about explicitly treating both get_mmio_spte() and this as potential data corruption? E.g. something like this, and then use the DATA_CORRUPTION variant in pte_list_remove()? diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 2055e04b8f89..1cb69c6d186b 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4075,6 +4075,7 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep) pr_err("------ spte = 0x%llx level = %d, rsvd bits = 0x%llx", sptes[level], level, get_rsvd_bits(rsvd_check, sptes[level], level)); + KVM_BUG_ON_DATA_CORRUPTION(reserved, vcpu->kvm); } return reserved; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index f16c4689322b..5c4a06f66f46 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -863,6 +863,17 @@ static inline void kvm_vm_bugged(struct kvm *kvm) unlikely(__ret); \ }) +#define KVM_BUG_ON_DATA_CORRUPTION(cond, kvm) \ +({ \ + int __ret = (cond); \ + \ + if (IS_ENABLED(CONFIG_BUG_ON_DATA_CORRUPTION)) \ + BUG_ON(__ret); \ + else if (WARN_ON_ONCE(__ret && !(kvm)->vm_bugged)) \ + kvm_vm_bugged(kvm); \ + unlikely(__ret); \ +}) + static inline void kvm_vcpu_srcu_read_lock(struct kvm_vcpu *vcpu) { #ifdef CONFIG_PROVE_RCU > > If, for example, the KVM_BUG() was triggered by a use-after-free, then > > there might be corrupted memory floating around in the machine. > > > > David, > > Your concern is quite reasonable. But given that both rmap and spte are > pointers/data structures managed by individual VMs, i.e., none of them > are global pointers, use-after-free is unlikely happening on cross-VM > cases. Being per-VM allocations doesn't change the behavior/impact of use-after-free. E.g. if there is no rmap found for a SPTE then there's a non-zero chance KVM has previously zapped the SPTE and freed the memory the SPTE pointed at, and thus KVM might be reading/writing memory that is now owned by something else in the kernel. > Even if there is, then shuting down those corrupted VMs is feasible > here, since pte_list_remove() basically does the checking. But the damage may already be done. And the KVM_REQ_VM_DEAD request wont't be recognized until the next vcpu_enter_enter_guest(), e.g. it won't prevent vCPUs (or even this vCPU) from processing more