On 2/16/23 19:18, Sean Christopherson wrote:
On Thu, Feb 16, 2023, Peter Gonda wrote:
From what we are seeing, there are at least 2 IOCTLs that VMM is
issuing on the source VM after the migration is completed. The first
one is KVM_IOEVENTFD for unwiring an eventfd used for the NVMe admin
queue during the NVMe device unplug sequence. The second IOCTL is
KVM_SET_USER_MEMORY_REGION for removing the memslots during VM
destruction. Failing any of these IOCTLs will cause the migration to
fail.
Does the VMM _need_ to cleanly teardown the source VM? If so, why?
I can see why it is much easier on userspace to use the regular clean up
sequence. However, I'm not sure about allowing a specific subset of
ioctls, and in fact KVM_SET_USER_MEMORY_REGION is one of those that are
more "scary" because it interacts with page table management. This is
not a big problem for SEV, but it's worse for TDX.
1) If we want to keep the vm_dead logic as is, this means changing to
VMM code in some pretty hacky way. We will need to distinguish between
regular VM shutdown to VM shutdown after migration. We will also need
to make absolutely sure that we don't leave any dangling data in the
kernel by skipping some of the cleanup stages.
If I were doing it in the VMM, I would wrap the ioctls with something like
ret = ioctl(vm->fd, cmd, arg);
if (ret == -1 && errno == EIO && vm->is_migrated ) {
switch (cmd) {
case KVM_IOEVENTFD:
/* safe because MMIO cannot be issued by the VM. */
case KVM_SET_USER_MEMORY_REGION:
/* safe because memory cannot be accessed by the VM. */
...
return 0;
}
}
which is yucky (the list of ioctls is hard to maintain) but at least
it's self-contained and documented.
2) If we want to remove the vm_dead logic we can simply not mark the
vm as dead after migration. It looks like it will just work but might
create special cases where IOCTLs can be called on a TD which isn't
valid anymore. From what I can tell, some of these code paths are
already protected by a check if hkid is assigned so it might not be a
big issue. Not sure how this will work for SEV but I'm guessing
there's a similar mechanism there as well.
3) We can also go half way and only block certain memory encryption
related IOCTLs if the VM got migrated. This will likely require more
changes when we try to push this ustream since it will require adding
a new field for vm_mem_enc_dead (or something similar) in addition to
the current vm_bugged and vm_dead.
I agree option 2 or 3 seem preferable. Option two sounds good to me, I
am not sure why we needed to disable all IOCLTs on the source VM after
the migration. I was just taking feedback on the review.
I don't like #2. For all intents and purposes, the source VM _is_ dead, or at
least zombified. It _was_ an SEV guest, but after migration it's no longer an
SEV guest, e.g. doesn't have a dedicated ASID, etc. But the CPUID state and a
pile of register state won't be coherent, especially on SEV-ES where KVM doesn't
have visibility into guest state.
I agree that (2) is the worst option.
We have the ASID similar to the HKID in SEV. I don't think the code
paths are already protected like you mention TDX is but that seems
like a simple enough fix. Or maybe it's better to introduce a new
VM_MOVED like VM_BUGGED and VM_DEAD which allows most IOCTLs but just
disables running vCPUs.
I kinda like the idea of a VM_MOVED flag, but I'm a bit leary of it from a
a maintenance and ABI perspective. Definining and documenting what ioctls()
are/aren't allowed would get rather messy. The beauty of VM_DEAD is that it's
all or nothing.
Yes, it's a bit of a cop out but _in practice_ VM_MOVED is going to be a
whack-a-mole job where each bug can have security consequences.
As above, I'm concerned with KVM's safety, not the guest's.
+1
Depending on why the source VM needs to be cleaned up, one thought would be add
a dedicated ioctl(), e.g. KVM_DISMANTLE_VM, and make that the _only_ ioctl() that's
allowed to operate on a dead VM. The ioctl() would be defined as a best-effort
mechanism to teardown/free internal state, e.g. destroy KVM_PIO_BUS, KVM_MMIO_BUS,
and memslots, zap all SPTEs, etc...
If we have to write the code we might as well do it directly at
context-move time, couldn't we? I like the idea of minimizing the
memory cost of the zombie VM.
Paolo