On Tue, 2024-09-17 at 11:22 +0530, Sandipan Das wrote: > On 9/17/2024 2:11 AM, dongli.zhang@xxxxxxxxxx wrote: > > On 9/16/24 11:54 AM, Maxim Levitsky wrote: > > > Hi! > > > > > > We recently saw a failure in one of the aws VM instances that causes the following error during the guest boot: > > > > > > 0.480051] unchecked MSR access error: WRMSR to 0xc0000302 (tried to write 0x040000000000001f) at rIP: 0xffffffff96c093e2 (amd_pmu_cpu_reset.constprop.0+0x42/0x80) > > > > > > > > > I investigated the issue and I see that the hypervisor does expose PerfmonV2, but not the LBRv2 support: > > > > > > # cpuid -1 -l 0x80000022 > > > CPU: > > > Extended Performance Monitoring and Debugging (0x80000022): > > > AMD performance monitoring V2 = true > > > AMD LBR V2 = false > > > AMD LBR stack & PMC freezing = false > > > number of core perf ctrs = 0x5 (5) > > > number of LBR stack entries = 0x0 (0) > > > number of avail Northbridge perf ctrs = 0x0 (0) > > > number of available UMC PMCs = 0x0 (0) > > > active UMCs bitmask = 0x0 > > > > > That's expected. LBRv2 is currently not available to KVM guests. However, PerfMonV2 should be the > only feature bit required to indicate the availability of MSRs 0xc0000300..0xc0000303 > > > > I also verified that I can write 0x1f to 0xc0000302 but not 0x040000000000001f: > > > > > > # wrmsr 0xc0000302 0x1f > > > # wrmsr 0xc0000302 0x040000000000001f > > > wrmsr: CPU 0 cannot set MSR 0xc0000302 to 0x040000000000001f > > > # > > > > > > The AMD's APM is not clear on what should happen if unsupported bits are attempted to be cleared > > > using this MSR. > > > > > > Also I noticed that amd_pmu_v2_handle_irq writes 0xffffffffffffffff to this msr. > > > It has the following code: > > > > > > > > > WARN_ON(status > 0); > > > > > > /* Clear overflow and freeze bits */ > > > amd_pmu_ack_global_status(~status); > > > > > > > > > This implies that it is OK to set all bits in this MSR. > > > > > It is, but writes to the reserved bits are ignored. > > > To share my data point on QEMU+KVM: I am not able to reproduce with the most > > recent QEMU (not AWS) + below patch. > > > > [PATCH v2 2/4] i386/cpu: Add PerfMonV2 feature bit > > https://lore.kernel.org/all/69905b486218f8287b9703d1a9001175d04c2f02.1723068946.git.babu.moger@xxxxxxx/ > > > > Both my VM and KVM are 6.10. > > > > vm# cpuid -1 -l 0x80000022 > > CPU: > > Extended Performance Monitoring and Debugging (0x80000022): > > AMD performance monitoring V2 = true > > AMD LBR V2 = false > > AMD LBR stack & PMC freezing = false > > number of core perf ctrs = 0x6 (6) > > number of LBR stack entries = 0x0 (0) > > number of avail Northbridge perf ctrs = 0x0 (0) > > number of available UMC PMCs = 0x0 (0) > > active UMCs bitmask = 0x0 > > > > > > Both writes are passed. > > > > vm# wrmsr 0xc0000302 0x1f > > vm# wrmsr 0xc0000302 0x040000000000001f > > > > Here is bcc output. Both writes are good. > > > > kvm# /usr/share/bcc/tools/trace -t -C 'kvm_pmu_set_msr "%x", retval' > > ... ... > > 4.748614 19 43545 43550 CPU 0/KVM kvm_pmu_set_msr 0 > > 10.97396 19 43545 43550 CPU 0/KVM kvm_pmu_set_msr 0 > > > > Thanks for testing. I cannot replicate this either with an upstream kernel. Hi, I also tested on bare metal Zen4 system just now, and I also see that MSR 0xc0000302 can be set to any value. So this is a hypervisor bug, I'll report it to AWS. Best regards, Maxim Levitsky > > - Sandipan >