Re: [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase

Emanuele Giuseppe Esposito <eesposit@xxxxxxxxxx> · Fri, 9 Sep 2022 10:00:47 +0200

Hi Peter,

Apologies for the delay to answer you.

[...]
>>
>> - Doing the above is still not enough, as KVM figures what operation to
>> do depending on the current state of the memslots.
>> Assuming we already have an already existing MR y, and now we get the
>> list DELETE(y) CREATE(y/2) (ie reducing y to half its size).
>> In this case the interval tree can't do anything, but it's very hard to
>> understand that the second request in the list is a CREATE, because when
>> KVM performs the check to understand which type of operation it is
>> (before doing any modification at all in the memslot list), it finds
>> that y (memslot with same id) exist, but has a different size than what
>> provided from userspace, therefore it could either fail, or misinterpret
>> it as another operation (currently fails -EINVALID).
> 
> Another good question..  I think that can be partly solved by not allowing
> ID duplication in the same batched ioctl, or maybe we can fail it.  From
> qemu perspective, we may need to change the memslot id allocation to not
> reuse IDs of being-deleted memslots until the batch is processed.
> 
> Something like adding similar INVALID tags to kvm memslots in QEMU when we
> are preparing the batch, then we should only reset memory_size==0 and clear
> INVALID tag after the ioctl returns.  Then during the batch, any new slots
> to be added from kvm_get_free_slot() will not return any duplication of a
> deleting one.

First of all, you're right. No interval tree is needed.

I think the approach I am currently using is something like what you
described above: when a DELETE operation is created in QEMU (there is no
MOVE), I set the invalid_slot flag to 1.
Then KVM will firstly process the requests with invalid_slot == 1, mark
the memslot invalid, and then process all the others (invalid included,
as they need the actual DELETE/MOVE operation).

> 
>> If we instead already provide the labels, then we can:
>> 	1. substitute the memslots pointed by DELETE/MOVE with invalid & swap
>> (so it is visible, non-atomic. But we don't care, as we are not deleting
>> anything)
>> 	2. delete the invalid memslot (in the inactive memslot list)
>> 	3. process the other requests (in the inactive memslot list)
>> 	4. single and atomic swap (step 2 and 3 are now visible).
>>
>> What do you think?
> 
> Adding some limitation to the new ioctl makes sense to me, especially
> moving the DELETE/MOVE to be handled earlier, rather than a complete
> mixture of all of them.
> 
> I'm wondering whether the batch ioctl can be layed out separately on the
> operations, then it can be clearly documented on the sequence of handling
> each op:
> 
>   struct kvm_userspace_memory_region_batch {
>          struct kvm_userspace_memory_region deletes[];
>          struct kvm_userspace_memory_region moves[];
>          struct kvm_userspace_memory_region creates[];
>          struct kvm_userspace_memory_region flags_only[];
>   };
> 
> So that the ordering can be very obvious.  But I didn't really think deeper
> than that.

No, I think it doesn't work. Oder needs to be preserved, I see many
DELETE+CREATE operations on the same slot id.
> 
> Side note: do we use MOVE at all in QEMU?

Nope :)

> 
>>
>> Bonus question: with this atomic operation, do we really need the
>> invalid memslot logic for MOVE/DELETE?
> 
> I think so.  Not an expert on that, but iiuc that's to make sure we'll zap
> all shadow paging that maps to the slots being deleted/moved.
> 
> Paolo can always help to keep me honest above.

Yes, we need to keep that logic.

v2 is coming soon.

Thank you,
Emanuele