On 1/25/23 17:47, Jim Mattson wrote:
Part of the definition of the API is that you can take
KVM_GET_SUPPORTED_CPUID and pass it to KVM_SET_CPUID2 for all vCPUs.
Returning host topology information for a random host vCPU definitely
violates the contract.
You are attempting to rewrite history. Leaf 0xB was added to > KVM_GET_SUPPORTED_CPUID in commit 0771671749b5 ("KVM: Enhance guest
cpuid management"), and the only documentation of the
KVM_GET_SUPPORTED_CPUID ioctl at that time was in the commit message:
- KVM_GET_SUPPORTED_CPUID: get all cpuid entries the host (and kvm)
supports
[...] the intention was to return the
host topology information for the current logical processor.
The handling of unknown features is so naive in that commit, that I
don't think it is possible to read anything from the implementation; and
it certainly should not be a paragon for a future-proof implementation
of KVM_GET_SUPPORTED_CPUID.
For example, it only hid _known_ CPUID leaves or features and passed the
unknown ones through, which you'll agree is completely broken. It also
didn't try to handle all leaves for which ECX might turn out to be
significant---which happened for EAX=7 so the commit returns a wrong
output for CPUID[EAX=7,ECX=0].EAX.
In other words, the only reason it handles 0xB is because it was known
to have subleaves.
We can get more information about how userspace was intended to use it
from the qemu-kvm fork, which at the time was practically the only KVM
userspace. As of 2009 it was only checking a handful of leaves:
https://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git/tree/target-i386/kvm.c?h=kvm-88#n133
so shall we say that userspace is supposed to build each CPUID leaf one
by one and use KVM_GET_SUPPORTED_CPUID2 for validation only? I think
the first committed documentation agrees: "Userspace can use the
information returned by this ioctl to construct cpuid information (for
KVM_SET_CPUID2) that is consistent with hardware, kernel, and userspace
capabilities, and with user requirements".
However, that's the theory. "Do not break userspace" also involves
looking at how userspace *really* uses the API, and make compromises to
cater to those uses; which is different from rewriting history.
And in practice, people basically stopped reading after "(for
KVM_SET_CPUID2)".
For example in kvmtool:
kvm_cpuid->nent = MAX_KVM_CPUID_ENTRIES;
if (ioctl(vcpu->kvm->sys_fd, KVM_GET_SUPPORTED_CPUID, kvm_cpuid) < 0)
die_perror("KVM_GET_SUPPORTED_CPUID failed");
filter_cpuid(kvm_cpuid, vcpu->cpu_id);
if (ioctl(vcpu->vcpu_fd, KVM_SET_CPUID2, kvm_cpuid) < 0)
die_perror("KVM_SET_CPUID2 failed");
where filter_cpuid only does minor adjustments that do not include 0xB,
0x1F and 0x8000001E. The result is a topology that makes no sense if
host #vCPUs != guest #vCPUs, and which doesn't include the correct APIC
id in EDX.
https://github.com/kvmtool/kvmtool/blob/5657dd3e48b41bc6db38fa657994bc0e030fd31f/x86/cpuid.c
crosvm does optionally attempt to pass through leaves 0xB and 0x1F, but
it fails to adjust the APIC id in EDX. On the other hand it also passes
through 0x8000001E if ctx.cpu_config.host_cpu_topology is false,
incorrectly. So on one hand this patch breaks host_cpu_topology ==
true, on the other hand it fixes host_cpu_topology == false on AMD
processors.
https://github.com/google/crosvm/blob/cc79897fc0813ee8412e6395648593898962ec82/x86_64/src/cpuid.rs#L121
The rust-vmm reference hypervisor adjusts the APIC id in EDX for 0xB but
not for 0x1F. Apart from that it passes through the host topology
leaves, again resulting in nonsensical topology for host #vCPUs != guest
#vCPUs.
https://github.com/rust-vmm/vmm-reference/blob/5cde58bc955afca8a180585a9f01c82d6277a755/src/vm-vcpu-ref/src/x86_64/cpuid.rs
Firecracker, finally, ignores KVM_GET_SUPPORTED_CPUID's output for 0xb
and 0x8000001E (good!) but fails to do the same for 0x1F, so this patch
is again a fix of sorts---having all zeroes in 0x1F is better than
having a value that is valid but inconsistent with 0xB.
https://github.com/firecracker-microvm/firecracker/blob/cdf4fef3011c51206f178bdf21ececb87caa16c1/src/cpuid/src/transformer/intel.rs#L120
https://github.com/firecracker-microvm/firecracker/blob/cdf4fef3011c51206f178bdf21ececb87caa16c1/src/cpuid/src/transformer/amd.rs#L88
So basically the only open source userspace that is penalized (but not
broken, and also partly fixed) by the patch is crosvm. QEMU doesn't
care, while firecracker/kvmtool/vmm-reference are a net positive.
Paolo
Any future changes to either the operational behavior or the
documented behavior of the ABI surely demand a version bump.