On Fri, 2025-01-24 at 16:12 -0800, Sean Christopherson wrote: > On Fri, Jan 24, 2025, Maxim Levitsky wrote: > > On Wed, 2025-01-22 at 13:02 -0800, Sean Christopherson wrote: > > > > > > Note that I read all msrs using 'rdmsr' userspace tool. > > > > > > > > > > I'm pretty sure debugging via 'rdmsr', i.e. /dev/msr, isn't going to work. I > > > > > assume perf is clobbering LBR MSRs on context switch, but I haven't tracked that > > > > > down to confirm (the code I see on inspecition is gated on at least one perf > > > > > event using LBRs). My guess is that there's a software bug somewhere in the > > > > > perf/KVM exchange. > > > > > > > > > > I confirmed that using 'rdmsr' and 'wrmsr' "loses" values, but that hacking KVM > > > > > to read/write all LBRs during initialization works with LBRs disabled. > > > > Hi! > > > > I finally got to the very bottom of this: > > > > First of all, your assumption that the kernel resets LBR related msrs on > > context switch after 'wrmsr' program finishes execution is wrong, because the > > kernel will only do this if it *itself* enables the LBR feature (that is when > > something like 'perf', uses a perf counter with a lbr call stack). > > > > Writes that 'wrmsr' tool does are not something that kernel expects so it > > doesn't do anything in this case. > > > > What is happening instead, is something completely different: Turns out that > > to shave off something like 50 nanoseconds, off the deep C-state entry/exit > > latency, some Intel CPU don't preserve LBR stack values over these C-state > > entries. > > Ugh. > > > > Ugh, but it does. On writes to any LBR, including LBR_TOS, KVM creates a "virtual" > > > LBR perf event. KVM then relies on perf to context switch LBR MSRs, i.e. relies > > > on perf to load the guest's values into hardware. At least, I think that's what > > > is supposed to happen. AFAIK, the perf-based LBR support has never been properly > > > document[*]. > > > > > > Anyways, my understanding of intel_pmu_handle_lbr_msrs_access() is that if the > > > vCPU's LBR perf event is scheduled out or can't be created, the guest's value is > > > effectively lost. Again, I don't know the "rules" for the LBR perf event, but > > > it wouldn't suprise me if your CI fails because something in the host conflicts > > > with KVM's LBR perf event. > > > > Actually you are partially wrong here too (although BIOS can be considered > > 'something on the host'). > > > > I was able to prove that the reason why the unit test fails *is* because BIOS > > left LBRs enabled: > > > > First of all, setting LBR bit manually in DEBUG_CTL does trigger this bug > > (I use a different machine now, which doesn't have the bios bug): > > ... > > > ==== Test Assertion Failure ==== > > x86_64/vmx_pmu_caps_test.c:202: r == v > > pid=8415 tid=8415 errno=0 - Success > > 1 0x0000000000404301: __suite_lbr_perf_capabilities at vmx_pmu_caps_test.c:202 > > 2 (inlined by) vmx_pmu_caps_lbr_perf_capabilities at vmx_pmu_caps_test.c:194 > > 3 (inlined by) wrapper_vmx_pmu_caps_lbr_perf_capabilities at vmx_pmu_caps_test.c:194 > > 4 0x000000000040511a: __run_test at kselftest_harness.h:1240 > > 5 0x0000000000402b95: test_harness_run at kselftest_harness.h:1310 > > 6 (inlined by) main at vmx_pmu_caps_test.c:246 > > 7 0x00007f56ba2295cf: ?? ??:0 > > 8 0x00007f56ba22967f: ?? ??:0 > > 9 0x0000000000402e44: _start at ??:? > > Set MSR_LBR_TOS to '0x7', got back '0xc' > > # lbr_perf_capabilities: Test failed > > # FAIL vmx_pmu_caps.lbr_perf_capabilities > > not ok 5 vmx_pmu_caps.lbr_perf_capabilities > > # RUN vmx_pmu_caps.perf_capabilities_unsupported ... > > # OK vmx_pmu_caps.perf_capabilities_unsupported > > ok 6 vmx_pmu_caps.perf_capabilities_unsupported > > # FAILED: 5 / 6 tests passed. > > # Totals: pass:5 fail:1 xfail:0 xpass:0 skip:0 error:0 > > > > Secondary I went over all places in the kernel and all of them take care to > > preserve DEBUG_CTL and only set/clear specific bits. > > > > __intel_pmu_lbr_enable() and __intel_pmu_lbr_enable() are practically the > > only two places where DEBUGCTLMSR_LBR bit is touched, and the test doesn't > > trigger them. Most likely because the test uses special > > 'INTEL_FIXED_VLBR_EVENT' perf event (see intel_pmu_create_guest_lbr_event) > > which is not enabled while in host mode. > > > > To double check this I traced all writes to DEBUG_CTL msr during this test > > and the only write is done during 'guest_wrmsr_perf_capabilities' subtest, by > > vmx_vcpu_run() which just restores the value that the msr had prior to VM > > entry. > > > > So, why the value that BIOS sets survives? Because as I said all code that > > touches DEBUG_CTL takes care to preserve all bits but the bit which is > > changed, LBRs are never enabled on the host, and even the guest entry > > preserves host DEBUG_CTL. Therefore the value written by BIOS survives. > > Well that's rather insane. > > > So we end up with the test writing to LBR_TOS while LBRs are unexpectedly > > enabled, so it's not a surprise that when the test reads back the value > > written, it will differ, and the test will rightfully fail. > > > > Since we have seen this in CI, and you saw it too in your CI, > > Gah, that was bad reporting on my end. The failure we saw was something else > entirely. > > > I think this BIOS bug is not that rare, and so I suggest to stick > > 'wrmsrl(MSR_IA32_DEBUGCTLMSR, 0)' somewhere early in a kernel boot code or at > > least clear the DEBUGCTLMSR_LBR bit. > > > > I haven't found a very good place to put this, in a way that I can be sure > > that x86 maintainers won't reject it, so I am open to your suggestions. > > Compile tested only, but perf's CPU online path seems appropriate, especially > since that path also explicitly clears LBRs. Ensuring LBRs are stopped before > clearing them seems logical. > > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c > index 99c590da0ae2..6e898b832d75 100644 > --- a/arch/x86/events/intel/core.c > +++ b/arch/x86/events/intel/core.c > @@ -5030,8 +5030,12 @@ static void intel_pmu_cpu_starting(int cpu) > > init_debug_store_on_cpu(cpu); > /* > - * Deal with CPUs that don't clear their LBRs on power-up. > + * Deal with CPUs that don't clear their LBRs on power-up, and with > + * BIOSes that leave LBRs enabled. > */ > + if (!static_cpu_has(X86_FEATURE_ARCH_LBR) && x86_pmu.lbr_nr) > + msr_clear_bit(MSR_IA32_DEBUGCTLMSR, DEBUGCTLMSR_LBR_BIT); > + > intel_pmu_lbr_reset(); > > cpuc->lbr_sel = NULL; > diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h > index 3ae84c3b8e6d..bb7dd85aa6f2 100644 > --- a/arch/x86/include/asm/msr-index.h > +++ b/arch/x86/include/asm/msr-index.h > @@ -395,7 +395,8 @@ > #define MSR_IA32_PASID_VALID BIT_ULL(31) > > /* DEBUGCTLMSR bits (others vary by model): */ > -#define DEBUGCTLMSR_LBR (1UL << 0) /* last branch recording */ > +#define DEBUGCTLMSR_LBR_BIT 0 > +#define DEBUGCTLMSR_LBR (1UL << DEBUGCTLMSR_LBR_BIT) /* last branch recording */ > #define DEBUGCTLMSR_BTF_SHIFT 1 > #define DEBUGCTLMSR_BTF (1UL << 1) /* single-step on branches */ > #define DEBUGCTLMSR_BUS_LOCK_DETECT (1UL << 2) > I did some simulated test which sets the DEBUGCTLMSR_LBR early in the boot and this patch, and it worked just fine. I agree that intel_pmu_cpu_starting is the best place to put this workaround. You might consider refactoring the code that deals with LBR setup into a function, like init_debug_store_on_cpu, maybe something like init_lbrs_on_cpu, but I don't mind that, this patch as-is, is fine as well. If I get my hands on the machine where this originally failed, I'll test there, although most likely this just a formality. Reviewed-by: Maxim Levitsky <mlevitsk@xxxxxxxxxx> Best regards, Maxim Levitsky