On Mon, Dec 12, 2022 at 8:46 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Mon, Dec 12, 2022, Mingwei Zhang wrote: > > On Fri, Dec 09, 2022, David Matlack wrote: > > > On Tue, Nov 29, 2022 at 07:12:35PM +0000, Mingwei Zhang wrote: > > > > Deprecate BUG() in pte_list_remove() in shadow mmu to avoid crashing a > > > > physical machine. There are several reasons and motivations to do so: > > > > > > > > MMU bug is difficult to discover due to various racing conditions and > > > > corner cases and thus it extremely hard to debug. The situation gets much > > > > worse when it triggers the shutdown of a host. Host machine crash might > > > > eliminates everything including the potential clues for debugging. > > > > > > > > From cloud computing service perspective, BUG() or BUG_ON() is probably no > > > > longer appropriate as the host reliability is top priority. Crashing the > > > > physical machine is almost never a good option as it eliminates innocent > > > > VMs and cause service outage in a larger scope. Even worse, if attacker can > > > > reliably triggers this code by diverting the control flow or corrupting the > > > > memory, then this becomes vm-of-death attack. This is a huge attack vector > > > > to cloud providers, as the death of one single host machine is not the end > > > > of the story. Without manual interferences, a failed cloud job may be > > > > dispatched to other hosts and continue host crashes until all of them are > > > > dead. > > > > > > My only concern with using KVM_BUG() is whether the machine can keep > > > running correctly after this warning has been hit. In other words, are > > > we sure the damage is contained to just this VM? > > Hmm, good point. The counter-argument is that KVM doesn't BUG() in get_mmio_spte() > when a non-MMIO SPTE has reserved bits set, and as we've seen internally in multiple > splats where the reserved bits appear to be set by stack overflow, that has a much, > much higher probability of being a symptom of data corruption. > > That said, that's more of a reason to change get_mmio_spte() than it is to ignore > potential data corruption in this case. KVM arguably should kill the VM if > get_mmio_spte() fails too. > > What about explicitly treating both get_mmio_spte() and this as potential data > corruption? E.g. something like this, and then use the DATA_CORRUPTION variant > in pte_list_remove()? > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 2055e04b8f89..1cb69c6d186b 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4075,6 +4075,7 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep) > pr_err("------ spte = 0x%llx level = %d, rsvd bits = 0x%llx", > sptes[level], level, > get_rsvd_bits(rsvd_check, sptes[level], level)); > + KVM_BUG_ON_DATA_CORRUPTION(reserved, vcpu->kvm); > } > > return reserved; > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index f16c4689322b..5c4a06f66f46 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -863,6 +863,17 @@ static inline void kvm_vm_bugged(struct kvm *kvm) > unlikely(__ret); \ > }) > > +#define KVM_BUG_ON_DATA_CORRUPTION(cond, kvm) \ > +({ \ > + int __ret = (cond); \ > + \ > + if (IS_ENABLED(CONFIG_BUG_ON_DATA_CORRUPTION)) \ > + BUG_ON(__ret); \ > + else if (WARN_ON_ONCE(__ret && !(kvm)->vm_bugged)) \ > + kvm_vm_bugged(kvm); \ > + unlikely(__ret); \ > +}) > + > static inline void kvm_vcpu_srcu_read_lock(struct kvm_vcpu *vcpu) > { > #ifdef CONFIG_PROVE_RCU That sounds reasonable to me.