Re: [PATCH] KVM: VMX: Enable Notify VM exit

Andy Lutomirski <luto@xxxxxxxxxx> · Mon, 2 Nov 2020 10:01:16 -0800

On Mon, Nov 2, 2020 at 9:31 AM Sean Christopherson
<sean.j.christopherson@xxxxxxxxx> wrote:
>
> On Mon, Nov 02, 2020 at 08:43:30AM -0800, Andy Lutomirski wrote:
> > On Sun, Nov 1, 2020 at 10:14 PM Tao Xu <tao3.xu@xxxxxxxxx> wrote:
> > > 2. Another patch to disable interception of #DB and #AC when notify
> > > VM-Exiting is enabled.
> >
> > Whoa there.
> >
> > A VM control that says "hey, CPU, if you messed up and livelocked for
> > a long time, please break out of the loop" is not a substitute for
> > fixing the livelocks.  So I don't think you get do disable
> > interception of #DB and #AC.
>
> I think that can be incorporated into a module param, i.e. let the platform
> owner decide which tool(s) they want to use to mitigate the legacy architecture
> flaws.

What's the point?  Surely the kernel should reliably mitigate the
flaw, and the kernel should decide how to do so.

>
> > I also think you should print a loud warning
>
> I'm not so sure on this one, e.g. userspace could just spin up a new instance
> if its malicious guest and spam the kernel log.

pr_warn_once()?  If this triggers, it's a *bug*, right?  Kernel or CPU.

>
> > and have some intelligent handling when this new exit triggers.
>
> We discussed something similar in the context of the new bus lock VM-Exit.  I
> don't know that it makes sense to try and add intelligence into the kernel.
> In many use cases, e.g. clouds, the userspace VMM is trusted (inasmuch as
> userspace can be trusted), while the guest is completely untrusted.  Reporting
> the error to userspace and letting the userspace stack take action is likely
> preferable to doing something fancy in the kernel.
>
>
> Tao, this patch should probably be tagged RFC, at least until we can experiment
> with the threshold on real silicon.  KVM and kernel behavior may depend on the
> accuracy of detecting actual attacks, e.g. if we can set a threshold that has
> zero false negatives and near-zero false postives, then it probably makes sense
> to be more assertive in how such VM-Exits are reported and logged.

If you can actually find a threshold that reliably mitigates the bug
and does not allow a guest to cause undesirably large latency in the
host, then fine.  1/10 if a tick is way too long, I think.