Re: Interaction between host-side mprotect() and KVM MMU

Martin Lucina <martin@xxxxxxxxxx> · Fri, 24 May 2019 14:03:28 +0200

On Thursday, 23.05.2019 at 07:53, Sean Christopherson wrote:
> > > > c. In order to enforce W^X both ways I'd like to have case (2) also fail
> > > > with EFAULT, is this possible?
> > > 
> > > Not without modifying KVM and the kernel (if you want to do it through
> > > mprotect()).
> > 
> > Hooking up the full EPT protection bits available to KVM via mprotect()
> > would be the best solution for us, and could also give us the ability to
> > have execute-only pages on x86, which is a nice defence against ROP attacks
> > in the guest. However, I can see now that this is not a trivial
> > undertaking, especially across the various MMU models (tdp, softmmu) and
> > architectures dealt with by the core KVM code.
> > 
> > N.B. We also have tender implementations for bhyve and OpenBSD vmm, and at
> > least in the OpenBSD case some community contributors are looking into
> > developing an "ept_mprotect" for precisely this use-case, though their vmm
> > code is much simpler (and does less) compared to KVM.
> > 
> > I take it there's no other way to mark a range of pages as NX by the guest
> > from the host side, so if we want this without modifying KVM and the
> > kernel, the only way to get it would be to set up "real" page tables inside
> > the guest ...?
> 
> Correct, KVM does currently support marking pages NX from the host.  But
> note that when EPT is enabled, KVM does not intercept writes to CR3, i.e.
> the guest can configure and load its own page page tables to bypass the
> restrictions of the tender, which may or may not be an issue.

I'm aware of that. I've considered various options over time, including
running untrusted guest code in Ring 3, but that would require quite a bit
more work on the the loader side to provide Ring 0 infrastructure in the
guest (e.g. exception reporting), which complicates the architecture and
"supply chain".

> On the other hand, modifying KVM to support NX via mprotect() in a limited
> capacity might be a relatively low effort option, e.g. support it as a
> per-module opt-in feature only when using TDP (EPT or NPT).

That would be an interesting feature, especially if it would also enable
marking guest pages as execute-only on a TDP host. Why the opt-in? To avoid
breaking existing userspace relying on the existing mprotect() behaviour?
Do you think it could be implemented as a run-time opt-in, e.g. via a
new KVM_CAP_*?

Martin