On 3/13/2024 6:55 AM, Sean Christopherson wrote:
-non-KVM people, +Mingwei, Aaron, Oliver, and Jim
On Sun, Feb 18, 2024, Yang Weijiang wrote:
case MSR_IA32_PERF_CAPABILITIES:
if (data && !vcpu_to_pmu(vcpu)->version)
return 1;
Ha, perfect, this is already in the diff context.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c0ed69353674..281c3fe728c5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1849,6 +1849,36 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type)
}
EXPORT_SYMBOL_GPL(kvm_msr_allowed);
+#define CET_US_RESERVED_BITS GENMASK(9, 6)
+#define CET_US_SHSTK_MASK_BITS GENMASK(1, 0)
+#define CET_US_IBT_MASK_BITS (GENMASK_ULL(5, 2) | GENMASK_ULL(63, 10))
+#define CET_US_LEGACY_BITMAP_BASE(data) ((data) >> 12)
+
+static bool is_set_cet_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u64 data,
+ bool host_initiated)
+{
...
+ /*
+ * If KVM supports the MSR, i.e. has enumerated the MSR existence to
+ * userspace, then userspace is allowed to write '0' irrespective of
+ * whether or not the MSR is exposed to the guest.
+ */
+ if (!host_initiated || data)
+ return false;
...
@@ -1951,6 +2017,20 @@ static int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
!guest_cpuid_has(vcpu, X86_FEATURE_RDPID))
return 1;
break;
+ case MSR_IA32_U_CET:
+ case MSR_IA32_S_CET:
+ if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) &&
+ !guest_can_use(vcpu, X86_FEATURE_IBT))
+ return 1;
As pointed out by Mingwei in a conversation about PERF_CAPABILITIES, rejecting
host *reads* while allowing host writes of '0' is inconsistent. Which, while
arguably par for the course for KVM's ABI, will likely result in the exact problem
we're trying to avoid: killing userspace because it attempts to access an MSR KVM
has said exists.
Thank you for the notification!
Agree on it.
PERF_CAPABILITIES has a similar, but opposite, problem where KVM returns a non-zero
value on reads, but rejects that same non-zero value on write. PERF_CAPABILITIES
is even more complicated because KVM stuff a non-zero value at vCPU creation, but
that's not really relevant to this discussion, just another data point for how
messed up this all is.
Also relevant to this discussion are KVM's PV MSRs, e.g. MSR_KVM_ASYNC_PF_ACK,
as KVM rejects attempts to write '0' if the guest doesn't support the MSR, but
if and only userspace has enabled KVM_CAP_ENFORCE_PV_FEATURE_CPUID.
Coming to the point, this mess is getting too hard to maintain, both from a code
perspective and "what is KVM's ABI?" perspective.
Rather than play whack-a-mole and inevitably end up with bugs and/or inconsistencies,
what if we (a) return KVM_MSR_RET_INVALID when an MSR access is denied based on
guest CPUID,
Can we define a new return value KVM_MSR_RET_REJECTED for this case in order to tell it from KVM_MSR_RET_INVALID which means the msr index doesn't exit?
(b) wrap userspace MSR accesses at the very top level and convert
KVM_MSR_RET_INVALID to "success" when KVM reported the MSR as savable and userspace
is reading or writing '0',
Yes, this can limit the change on KVM side.
and (c) drop all of the host_initiated checks that
exist purely to exempt userspace access from guest CPUID checks.
The only possible hiccup I can think of is that this could subtly break userspace
that is setting CPUID _after_ MSRs, but my understanding is that we've agreed to
draw a line and say that that's unsupported.
Yeah, it would mess up things.
And I think it's low risk, because
I don't see how code like this:
case MSR_TSC_AUX:
if (!kvm_is_supported_user_return_msr(MSR_TSC_AUX))
return 1;
if (!host_initiated &&
!guest_cpuid_has(vcpu, X86_FEATURE_RDTSCP) &&
!guest_cpuid_has(vcpu, X86_FEATURE_RDPID))
return 1;
if (guest_cpuid_is_intel(vcpu) && (data >> 32) != 0)
return 1;
can possibly work if userspace sets MSRs first. The RDTSCP/RDPID checks are
exempt, but the vendor in guest CPUID would be '0', not Intel's magic string,
and so setting MSRs before CPUID would fail, at least if the target vCPU model
is Intel.
P.S. I also want to rename KVM_MSR_RET_INVALID => KVM_MSR_RET_UNSUPPORTED, because
I can never remember that "invalid" doesn't mean the value was invalid, it means
the MSR index was invalid.
So do I :-)
It'll take a few patches, but I believe we can end up with something like this:
static bool kvm_is_msr_to_save(u32 msr_index)
{
unsigned int i;
for (i = 0; i < num_msrs_to_save; i++) {
if (msrs_to_save[i] == msr_index)
return true;
}
Should we also check emulated_msrs list here since KVM_GET_MSR_INDEX_LIST exposes it too?
return false;
}
typedef int (*msr_uaccess_t)(struct kvm_vcpu *vcpu, u32 index, u64 *data,
bool host_initiated);
static __always_inline int kvm_do_msr_uaccess(struct kvm_vcpu *vcpu, u32 msr,
u64 *data, bool host_initiated,
enum kvm_msr_access rw,
msr_uaccess_t msr_uaccess_fn)
{
const char *op = rw == MSR_TYPE_W ? "wrmsr" : "rdmsr";
int ret;
BUILD_BUG_ON(rw != MSR_TYPE_R && rw != MSR_TYPE_W);
/*
* Zero the data on read failures to avoid leaking stack data to the
* guest and/or userspace, e.g. if the failure is ignored below.
*/
ret = msr_uaccess_fn(vcpu, msr, data, host_initiated);
if (ret && rw == MSR_TYPE_R)
*data = 0;
if (ret != KVM_MSR_RET_UNSUPPORTED)
return ret;
/*
* Userspace is allowed to read MSRs, and write '0' to MSRs, that KVM
* reports as to-be-saved, even if an MSRs isn't fully supported.
* Simply check that @data is '0', which covers both the write '0' case
* and all reads (in which case @data is zeroed on failure; see above).
*/
if (kvm_is_msr_to_save(msr) && !*data)
return 0;
if (!ignore_msrs) {
kvm_debug_ratelimited("unhandled %s: 0x%x data 0x%llx\n",
op, msr, *data);
return ret;
}
if (report_ignored_msrs)
kvm_pr_unimpl("ignored %s: 0x%x data 0x%llx\n", op, msr, *data);
return 0;
}
The handling flow looks good to me. Thanks a lot!