Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Thu, 29 Sep 2022 17:28:54 +0200

On 9/28/22 22:41, Sean Christopherson wrote:
On Wed, Sep 28, 2022, Paolo Bonzini wrote:
On 9/28/22 17:58, Sean Christopherson wrote:
I don't disagree that the memslots API is lacking, but IMO that is somewhat
orthogonal to fixing KVM x86's "code fetch to MMIO" mess.  Such a massive new API
should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd
like to have the luxury of being able to explore ideas beyond "let userspace
batch memslot updates", and I really don't want to feel pressured to get this
code reviewed and merge.

I absolutely agree that this is not a bugfix.  Most new features for KVM can
be seen as bug fixes if you squint hard enough, but they're still features.

I guess I'm complaining that there isn't sufficient justification for this new
feature.  The cover letter provides a bug that would be fixed by having batched
updates, but as above, that's really due to deficiencies in a different KVM ABI.

I disagree.  Failure to fetch should be fixed but is otherwise a red 
herring.  It's the whole memslot API (including dirty logging) that is a 
mess.

If you think we should overhaul it even more than just providing atomic 
batched updates, that's fine.  But still, the impossibility to perform 
atomic updates in batches *is* a suboptimal part of the KVM API.

   - Why can't this be solved in userspace?

I don't think *can't* is the right word.  If the metric of choice was 
"what can be solved in userspace", we'd all be using microkernels.  The 
question is why userspace would be a better place to solve it.

The only reason to do it in userspace would be if failure to fetch is 
something that is interesting to userspace, other than between two 
KVM_SET_USER_MEMORY_REGION.  Unless you provide an API to pass failures 
to fetch down to userspace, the locking in userspace is going to be 
inferior, because it would have to be unconditional.  This means worse 
performance and more complication, not to mention having to do it N 
times instead of 1 for N implementations.

Not forcing userspace to do "tricks" is in my opinion a strong part of 
deciding whether an API belongs in KVM.

   - What operations does userspace truly need?  E.g. if the only use case is to
     split/truncate/hole punch an existing memslot, can KVM instead provide a
     memslot flag and exit reason that allows kicking vCPUs to userspace if the
     memslot is accessed?  E.g. KVM_MEM_DISABLED that acts like an invalid memslot,
     but KVM exists with a dedicated exit reason instead of generating MMIO semantics.

The main cases are:

- for the boot case, splitting and merging existing memslots.  QEMU 
likes to merge adjacent memory regions into a single memslot, so if 
something goes from read-write to read-only it has to be split and vice 
versa.  I guess a "don't merge this memory region" flag would be the 
less hideous way to solve it in userspace.

- however, there is also the case of resizing an existing memslot which 
is what David would like to have for virtio-mem.  This is not really 
fixable because part of the appeal of virtio-mem is to have a single 
huge memslot instead of many smaller ones, in order to reduce the 
granularity of add/remove (David, correct me if I'm wrong).

(The latter _would_ be needed by other VMMs).

If updates only need to be "atomic" for an address space, does the API allowing
mixing non-SMM and SMM memslots?

I agree that the address space should be moved out of the single entries 
and into the header if we follow through with this approach.

The update needs to be "atomic", i.e. vCPUs
must never see an invalid/deleted memslot, but if the memslot is writable,
how does KVM prevent some writes from hitting the old HVA and some from hitting
the new HVA without a quiescent period?

(Heh, and I forgot likewise that non-x86 does not retry on 
KVM_MEMSLOT_INVALID.  Yes, that would be treated as a bug on other 
architectures).

Wholesale replacement was my first idea when I looked at the issue, I think
at the end of 2020.  I never got to a full implementation, but my impression
was that allocating/deallocating dirty bitmaps, rmaps etc. would make it any
easier than arbitrary batch updates.

It's not obvious to me that the memslot metadata is going to be easy to handle
regardless of what we do.  E.g. I'm pretty sure that batching updates will "corrupt"
the dirty bitmap if a hole is punched in a memslot that's being dirty logged.

Indeed; I would have thought that it is clear with the batch updates API 
(which requires the update to be split into delete and insert), but 
apparently it's not and it's by no means optimal.

Paolo