On Wed, Jan 25, 2023 at 1:46 PM Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote: > > On 1/25/23 17:47, Jim Mattson wrote: > >> Part of the definition of the API is that you can take > >> KVM_GET_SUPPORTED_CPUID and pass it to KVM_SET_CPUID2 for all vCPUs. > >> Returning host topology information for a random host vCPU definitely > >> violates the contract. > > > > You are attempting to rewrite history. Leaf 0xB was added to > KVM_GET_SUPPORTED_CPUID in commit 0771671749b5 ("KVM: Enhance guest > > cpuid management"), and the only documentation of the > > KVM_GET_SUPPORTED_CPUID ioctl at that time was in the commit message: > > > > - KVM_GET_SUPPORTED_CPUID: get all cpuid entries the host (and kvm) > > supports > > > > [...] the intention was to return the > > host topology information for the current logical processor. > > The handling of unknown features is so naive in that commit, that I > don't think it is possible to read anything from the implementation; and > it certainly should not be a paragon for a future-proof implementation > of KVM_GET_SUPPORTED_CPUID. > > For example, it only hid _known_ CPUID leaves or features and passed the > unknown ones through, which you'll agree is completely broken. It also > didn't try to handle all leaves for which ECX might turn out to be > significant---which happened for EAX=7 so the commit returns a wrong > output for CPUID[EAX=7,ECX=0].EAX. > > In other words, the only reason it handles 0xB is because it was known > to have subleaves. > > We can get more information about how userspace was intended to use it > from the qemu-kvm fork, which at the time was practically the only KVM > userspace. As of 2009 it was only checking a handful of leaves: > > https://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git/tree/target-i386/kvm.c?h=kvm-88#n133 > > so shall we say that userspace is supposed to build each CPUID leaf one > by one and use KVM_GET_SUPPORTED_CPUID2 for validation only? I think > the first committed documentation agrees: "Userspace can use the > information returned by this ioctl to construct cpuid information (for > KVM_SET_CPUID2) that is consistent with hardware, kernel, and userspace > capabilities, and with user requirements". > > However, that's the theory. "Do not break userspace" also involves > looking at how userspace *really* uses the API, and make compromises to > cater to those uses; which is different from rewriting history. > > And in practice, people basically stopped reading after "(for > KVM_SET_CPUID2)". > > For example in kvmtool: > > kvm_cpuid->nent = MAX_KVM_CPUID_ENTRIES; > if (ioctl(vcpu->kvm->sys_fd, KVM_GET_SUPPORTED_CPUID, kvm_cpuid) < 0) > die_perror("KVM_GET_SUPPORTED_CPUID failed"); > > filter_cpuid(kvm_cpuid, vcpu->cpu_id); > > if (ioctl(vcpu->vcpu_fd, KVM_SET_CPUID2, kvm_cpuid) < 0) > die_perror("KVM_SET_CPUID2 failed"); > > where filter_cpuid only does minor adjustments that do not include 0xB, > 0x1F and 0x8000001E. The result is a topology that makes no sense if > host #vCPUs != guest #vCPUs, and which doesn't include the correct APIC > id in EDX. > > https://github.com/kvmtool/kvmtool/blob/5657dd3e48b41bc6db38fa657994bc0e030fd31f/x86/cpuid.c > > > crosvm does optionally attempt to pass through leaves 0xB and 0x1F, but > it fails to adjust the APIC id in EDX. On the other hand it also passes > through 0x8000001E if ctx.cpu_config.host_cpu_topology is false, > incorrectly. So on one hand this patch breaks host_cpu_topology == > true, on the other hand it fixes host_cpu_topology == false on AMD > processors. > > https://github.com/google/crosvm/blob/cc79897fc0813ee8412e6395648593898962ec82/x86_64/src/cpuid.rs#L121 > > > The rust-vmm reference hypervisor adjusts the APIC id in EDX for 0xB but > not for 0x1F. Apart from that it passes through the host topology > leaves, again resulting in nonsensical topology for host #vCPUs != guest > #vCPUs. > > https://github.com/rust-vmm/vmm-reference/blob/5cde58bc955afca8a180585a9f01c82d6277a755/src/vm-vcpu-ref/src/x86_64/cpuid.rs > > > Firecracker, finally, ignores KVM_GET_SUPPORTED_CPUID's output for 0xb > and 0x8000001E (good!) but fails to do the same for 0x1F, so this patch > is again a fix of sorts---having all zeroes in 0x1F is better than > having a value that is valid but inconsistent with 0xB. > > https://github.com/firecracker-microvm/firecracker/blob/cdf4fef3011c51206f178bdf21ececb87caa16c1/src/cpuid/src/transformer/intel.rs#L120 > https://github.com/firecracker-microvm/firecracker/blob/cdf4fef3011c51206f178bdf21ececb87caa16c1/src/cpuid/src/transformer/amd.rs#L88 > > > So basically the only open source userspace that is penalized (but not > broken, and also partly fixed) by the patch is crosvm. QEMU doesn't > care, while firecracker/kvmtool/vmm-reference are a net positive. > > Paolo The topology leaves returned by KVM_GET_SUPPORTED_CPUID *for over a decade* have been passed through unmodified from the host. They have never made sense for KVM_SET_CPUID2, with the unlikely exception of a whole-host VM. Our VMM populates the topology of the guest CPUID table on its own, as any VMM must. However, it uses the host topology (which KVM_GET_SUPPORTED_CPUID has been providing pass-through *for over a decade*) to see if the requested guest topology is possible. Changing a long-established ABI in a way that breaks userspace applications is a bad practice. I didn't think we, as a community, did that. I didn't realize that we were only catering to open source implementations here.