On 12/10/2018 18:30, Andi Kleen wrote: >> 4. Results >> - Without this optimization, the guest pmi handling time is >> ~4500000 ns, and the max sampling rate is reduced to 250. >> - With this optimization, the guest pmi handling time is ~9000 ns >> (i.e. 1 / 500 of the non-optimization case), and the max sampling >> rate remains at the original 100000. > > Impressive performance improvement! Agreed! > It's not clear to me why you're special casing PMIs here. The optimization > should work generically, right? Yeah, you can even just check if the counter is in the struct cpu_hw_events guest mask, and if so always write the counter MSR directly. >> @@ -237,9 +267,23 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) >> default: >> if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) || >> (pmc = get_fixed_pmc(pmu, msr))) { >> - if (!msr_info->host_initiated) >> - data = (s64)(s32)data; >> - pmc->counter += data - pmc_read_counter(pmc); >> + if (pmu->in_pmi) { >> + /* >> + * Since we are not re-allocating a perf event >> + * to reconfigure the sampling time when the >> + * guest pmu is in PMI, just set the value to >> + * the hardware perf counter. Counting will >> + * continue after the guest enables the >> + * counter bit in MSR_CORE_PERF_GLOBAL_CTRL. >> + */ >> + struct hw_perf_event *hwc = >> + &pmc->perf_event->hw; >> + wrmsrl(hwc->event_base, data); > > Is that guaranteed to be always called on the right CPU that will run the vcpu? > > AFAIK there's an ioctl to set MSRs in the guest from qemu, I'm pretty sure > it won't handle that. How much of the performance improvement comes from here? In theory pmc_read_counter() should always hit a relatively fast path, because the smp_call_function_single in perf_event_read doesn't need an IPI. In any case, this should be a separate patch. Paolo > May need to be delayed to entry time. > > -Andi >