Issue with "KVM: SEV: Add support for SEV intra host migration"

Sagi Shahar <sagis@xxxxxxxxxx> · Mon, 13 Feb 2023 12:07:08 -0800

TL;DR
Marking an SEV VM as dead after intra-host migration prevents cleanly tearing
down said VM.

We are testing our POC code for TDX copyless migration and notice some
issues. We are currently using a similar scheme to the one used for
SEV where the VM is marked as dead after the migration is completed
which prevents any other IOCTLs from being triggered on the VM.

>From what we are seeing, there are at least 2 IOCTLs that VMM is
issuing on the source VM after the migration is completed. The first
one is KVM_IOEVENTFD for unwiring an eventfd used for the NVMe admin
queue during the NVMe device unplug sequence. The second IOCTL is
KVM_SET_USER_MEMORY_REGION for removing the memslots during VM
destruction. Failing any of these IOCTLs will cause the migration to
fail.

I ran 2 simple experiments to test the behavior of these IOCTLs.

First experiment was removing the code which marks the VM as dead.
After a small fix to the TDX migration logic it looks like both IOCTLs
can succeed even after the VM got migrated.

Second experiment was to always return success if the VM is marked as
dead. This simulates the case where these IOCTLs never get called from
VMM after the migration.

In both cases I'm seeing the same behavior in the overall migration
process. I'm getting a cgroup related error where it fails to delete
the CGROUP_CPU file but my guess is that this is not related to the
IOCTL issue.

I can see 3 options:

1) If we want to keep the vm_dead logic as is, this means changing to
VMM code in some pretty hacky way. We will need to distinguish between
regular VM shutdown to VM shutdown after migration. We will also need
to make absolutely sure that we don't leave any dangling data in the
kernel by skipping some of the cleanup stages.

2) If we want to remove the vm_dead logic we can simply not mark the
vm as dead after migration. It looks like it will just work but might
create special cases where IOCTLs can be called on a TD which isn't
valid anymore. From what I can tell, some of these code paths are
already  protected by a check if hkid is assigned so it might not be a
big issue. Not sure how this will work for SEV but I'm guessing
there's a similar mechanism there as well.

3) We can also go half way and only block certain memory encryption
related IOCTLs if the VM got migrated. This will likely require more
changes when we try to push this ustream since it will require adding
a new field for vm_mem_enc_dead (or something similar) in addition to
the current vm_bugged and vm_dead.

Personally, I don't want to go with option (1) since it sounds quite
risky to make these kind of changes without fully understanding all
the possible side effects.

I prefer either option (2) or (3) but I don't know which one will be
more acceptable by the community.