On Mon, Nov 2, 2020 at 9:31 AM Sean Christopherson <sean.j.christopherson@xxxxxxxxx> wrote: > > On Mon, Nov 02, 2020 at 08:43:30AM -0800, Andy Lutomirski wrote: > > On Sun, Nov 1, 2020 at 10:14 PM Tao Xu <tao3.xu@xxxxxxxxx> wrote: > > > 2. Another patch to disable interception of #DB and #AC when notify > > > VM-Exiting is enabled. > > > > Whoa there. > > > > A VM control that says "hey, CPU, if you messed up and livelocked for > > a long time, please break out of the loop" is not a substitute for > > fixing the livelocks. So I don't think you get do disable > > interception of #DB and #AC. > > I think that can be incorporated into a module param, i.e. let the platform > owner decide which tool(s) they want to use to mitigate the legacy architecture > flaws. What's the point? Surely the kernel should reliably mitigate the flaw, and the kernel should decide how to do so. > > > I also think you should print a loud warning > > I'm not so sure on this one, e.g. userspace could just spin up a new instance > if its malicious guest and spam the kernel log. pr_warn_once()? If this triggers, it's a *bug*, right? Kernel or CPU. > > > and have some intelligent handling when this new exit triggers. > > We discussed something similar in the context of the new bus lock VM-Exit. I > don't know that it makes sense to try and add intelligence into the kernel. > In many use cases, e.g. clouds, the userspace VMM is trusted (inasmuch as > userspace can be trusted), while the guest is completely untrusted. Reporting > the error to userspace and letting the userspace stack take action is likely > preferable to doing something fancy in the kernel. > > > Tao, this patch should probably be tagged RFC, at least until we can experiment > with the threshold on real silicon. KVM and kernel behavior may depend on the > accuracy of detecting actual attacks, e.g. if we can set a threshold that has > zero false negatives and near-zero false postives, then it probably makes sense > to be more assertive in how such VM-Exits are reported and logged. If you can actually find a threshold that reliably mitigates the bug and does not allow a guest to cause undesirably large latency in the host, then fine. 1/10 if a tick is way too long, I think.