On Wed, Sep 27, 2023 at 9:10 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Tue, Sep 26, 2023, Xin Li wrote: > > On 9/26/2023 9:15 PM, Mingwei Zhang wrote: > > > ah, typo in the subject: The 2nd KVM_REQ_NMI should be KVM_REQ_PMI. > > > Sorry about that. > > > > > > On Tue, Sep 26, 2023 at 9:09 PM Mingwei Zhang <mizhang@xxxxxxxxxx> wrote: > > > > > > > > Move kvm_check_request(KVM_REQ_NMI) after kvm_check_request(KVM_REQ_NMI). > > > > Please remove it, no need to repeat the subject. > > Heh, from Documentation/process/maintainer-kvm-x86.rst: > > Changelog > ~~~~~~~~~ > Most importantly, write changelogs using imperative mood and avoid pronouns. > > See :ref:`describe_changes` for more information, with one amendment: lead with > a short blurb on the actual changes, and then follow up with the context and > background. Note! This order directly conflicts with the tip tree's preferred > approach! Please follow the tip tree's preferred style when sending patches > that primarily target arch/x86 code that is _NOT_ KVM code. > > That said, I do prefer that the changelog intro isn't just a copy+paste of the > shortlog, and the shortlog and changelog should use conversational language instead > of describing the literal code movement. > > > > > When vPMU is active use, processing each KVM_REQ_PMI will generate a > > This is not guaranteed. > > > > > KVM_REQ_NMI. Existing control flow after KVM_REQ_PMI finished will fail the > > > > guest enter, jump to kvm_x86_cancel_injection(), and re-enter > > > > vcpu_enter_guest(), this wasted lot of cycles and increase the overhead for > > > > vPMU as well as the virtualization. > > As above, use conversational language, the changelog isn't meant to be a play-by-play. > > E.g. > > KVM: x86: Service NMI requests *after* PMI requests in VM-Enter path > > Move the handling of NMI requests after PMI requests in the VM-Enter path > so that KVM doesn't need to cancel and redo VM-Enter in the likely > scenario that the vCPU has configured its LVPTC entry to generate an NMI. > > Because APIC emulation "injects" NMIs via KVM_REQ_NMI, handling PMI > requests after NMI requests means KVM won't detect the pending NMI request > until the final check for outstanding requests. Detecting requests at the > final stage is costly as KVM has already loaded guest state, potentially > queued events for injection, disabled IRQs, dropped SRCU, etc., most of > which needs to be unwound. > > > Optimization is after correctness, so please explain if this is correct > > first! > > Not first. Leading with an in-depth description of KVM requests and NMI handling > is not going to help understand *why* this change is being first. But I do agree > that this should provide an analysis of why it's ok to swap the order, specificially > why it's architecturally ok if KVM drops an NMI due to the swapped ordering, e.g. > if the PMI is coincident with two other NMIs (or one other NMI and NMIs are blocked). > > > > > So move the code snippet of kvm_check_request(KVM_REQ_NMI) to make KVM > > > > runloop more efficient with vPMU. > > > > > > > > To evaluate the effectiveness of this change, we launch a 8-vcpu QEMU VM on > > Avoid pronouns. There's no need for all the "fluff", just state the setup, the > test, and the results. > > Really getting into the nits, but the whole "8-vcpu QEMU VM" versus > "the setup of using single core, single thread" is confusing IMO. If there were > potential performance downsides and/or tradeoffs, then getting the gory details > might be necessary, but that's not the case here, and if it were really necessary > to drill down that deep, then I would want to better quantify the impact, e.g. in > terms latency. > > E.g. on Intel SPR running SPEC2017 benchmark and Intel vtune in the guest, > handling PMI requests before NMI requests reduces the number of canceled > runs by ~1500 per second, per vCPU (counted by probing calls to > vmx_cancel_injection()). > > > > > an Intel SPR CPU. In the VM, we run perf with all 48 events Intel vtune > > > > uses. In addition, we use SPEC2017 benchmark programs as the workload with > > > > the setup of using single core, single thread. > > > > > > > > At the host level, we probe the invocations to vmx_cancel_injection() with > > > > the following command: > > > > > > > > $ perf probe -a vmx_cancel_injection > > > > $ perf stat -a -e probe:vmx_cancel_injection -I 10000 # per 10 seconds > > > > > > > > The following is the result that we collected at beginning of the spec2017 > > > > benchmark run (so mostly for 500.perlbench_r in spec2017). Kindly forgive > > > > the incompleteness. > > > > > > > > On kernel without the change: > > > > 10.010018010 14254 probe:vmx_cancel_injection > > > > 20.037646388 15207 probe:vmx_cancel_injection > > > > 30.078739816 15261 probe:vmx_cancel_injection > > > > 40.114033258 15085 probe:vmx_cancel_injection > > > > 50.149297460 15112 probe:vmx_cancel_injection > > > > 60.185103088 15104 probe:vmx_cancel_injection > > > > > > > > On kernel with the change: > > > > 10.003595390 40 probe:vmx_cancel_injection > > > > 20.017855682 31 probe:vmx_cancel_injection > > > > 30.028355883 34 probe:vmx_cancel_injection > > > > 40.038686298 31 probe:vmx_cancel_injection > > > > 50.048795162 20 probe:vmx_cancel_injection > > > > 60.069057747 19 probe:vmx_cancel_injection > > > > > > > > From the above, it is clear that we save 1500 invocations per vcpu per > > > > second to vmx_cancel_injection() for workloads like perlbench. > > Nit, this really should have: > > Suggested-by: Sean Christopherson <seanjc@xxxxxxxxxx> > > I personally don't care about the attribution, but (a) others often do care and > (b) the added context is helpful. E.g. for bad/questionable suggestsions/ideas, > knowing that person X was also involved helps direct and/or curate questions/comments > accordingly. For sure! I will also pay more attention to that in the future. Thanks. -Mingwei