Re: [RFC PATCH v4 0/2] Deprecate BUG() in pte_list_remove() in shadow mmu

David Matlack <dmatlack@xxxxxxxxxx> · Mon, 12 Dec 2022 16:09:28 -0800

On Mon, Dec 12, 2022 at 8:46 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Mon, Dec 12, 2022, Mingwei Zhang wrote:
> > On Fri, Dec 09, 2022, David Matlack wrote:
> > > On Tue, Nov 29, 2022 at 07:12:35PM +0000, Mingwei Zhang wrote:
> > > > Deprecate BUG() in pte_list_remove() in shadow mmu to avoid crashing a
> > > > physical machine. There are several reasons and motivations to do so:
> > > >
> > > > MMU bug is difficult to discover due to various racing conditions and
> > > > corner cases and thus it extremely hard to debug. The situation gets much
> > > > worse when it triggers the shutdown of a host. Host machine crash might
> > > > eliminates everything including the potential clues for debugging.
> > > >
> > > > From cloud computing service perspective, BUG() or BUG_ON() is probably no
> > > > longer appropriate as the host reliability is top priority. Crashing the
> > > > physical machine is almost never a good option as it eliminates innocent
> > > > VMs and cause service outage in a larger scope. Even worse, if attacker can
> > > > reliably triggers this code by diverting the control flow or corrupting the
> > > > memory, then this becomes vm-of-death attack. This is a huge attack vector
> > > > to cloud providers, as the death of one single host machine is not the end
> > > > of the story. Without manual interferences, a failed cloud job may be
> > > > dispatched to other hosts and continue host crashes until all of them are
> > > > dead.
> > >
> > > My only concern with using KVM_BUG() is whether the machine can keep
> > > running correctly after this warning has been hit. In other words, are
> > > we sure the damage is contained to just this VM?
>
> Hmm, good point.  The counter-argument is that KVM doesn't BUG() in get_mmio_spte()
> when a non-MMIO SPTE has reserved bits set, and as we've seen internally in multiple
> splats where the reserved bits appear to be set by stack overflow, that has a much,
> much higher probability of being a symptom of data corruption.
>
> That said, that's more of a reason to change get_mmio_spte() than it is to ignore
> potential data corruption in this case.  KVM arguably should kill the VM if
> get_mmio_spte() fails too.
>
> What about explicitly treating both get_mmio_spte() and this as potential data
> corruption?  E.g. something like this, and then use the DATA_CORRUPTION variant
> in pte_list_remove()?
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 2055e04b8f89..1cb69c6d186b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4075,6 +4075,7 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
>                         pr_err("------ spte = 0x%llx level = %d, rsvd bits = 0x%llx",
>                                sptes[level], level,
>                                get_rsvd_bits(rsvd_check, sptes[level], level));
> +               KVM_BUG_ON_DATA_CORRUPTION(reserved, vcpu->kvm);
>         }
>
>         return reserved;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index f16c4689322b..5c4a06f66f46 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -863,6 +863,17 @@ static inline void kvm_vm_bugged(struct kvm *kvm)
>         unlikely(__ret);                                        \
>  })
>
> +#define KVM_BUG_ON_DATA_CORRUPTION(cond, kvm)                  \
> +({                                                             \
> +       int __ret = (cond);                                     \
> +                                                               \
> +       if (IS_ENABLED(CONFIG_BUG_ON_DATA_CORRUPTION))          \
> +               BUG_ON(__ret);                                  \
> +       else if (WARN_ON_ONCE(__ret && !(kvm)->vm_bugged))      \
> +               kvm_vm_bugged(kvm);                             \
> +       unlikely(__ret);                                        \
> +})
> +
>  static inline void kvm_vcpu_srcu_read_lock(struct kvm_vcpu *vcpu)
>  {
>  #ifdef CONFIG_PROVE_RCU

That sounds reasonable to me.