On Mon, Jul 20, 2020 at 12:08:05PM -0700, Jacob Xu wrote: > Hi all, > > I wrote a kvm selftest to enter a 1-vcpu guest VM running a nop in one > thread and check the VM's debugfs entry for # of tlb flushes while > that was running. Installing the latest upstream kernel and running > this on an intel host showed a tlb_flush count of 30, while running it > on an amd host shows the tlb_flush count at 0. > > Do we have an inconsistency between Intel and AMD in how VCPU_STAT is > incremented? Yes, you can even drop "between Intel and AMD" and just state "we have inconsistencies in how VCPU_STAT is incremented". > From browsing the code, we see that the stat only gets incremented in > the kvm_ wrappers of the x86_ops functions tlb_flush_all, > tlb_flush_current, and tlb_flush_guest. These wrappers are only called > via the KVM request api (referred to as deferred flushes in some other > threads), and there other instances of calling the x86_ops tlb_flush > methods directly (immediate flush). > > It looks like most of the tlb flush calls are deferred, but there are > a few instances using the immediate flush where it's not counted > (kvm_mmu_load, svm_set_cr4, vmx_set_apic_access_page, > nested_prepare_vmcb_control). Is there a guideline on when to > deferred/immediate tlb_flush? The rule of thumb is to defer the flush whenever possible so that KVM doesn't flush multiple times in a single VM-Exit. However, of the above three, svm_set_cr4() is the only one that can be deferred, but only because kvm_mmu_load() and vmx_set_apic_access_page() are reachable after KVM_REQ_TLB_FLUSH_CURRENT is processed in vcpu_enter_guest(). The VMX APIC flush could be deferred by hoisting KVM_REQ_APIC_PAGE_RELOAD up. I think that's safe? But it's a very infrequent operation so I'm not exactly chomping at the bit to get it fixed. > Could this be a cause for the lower tlb_flush stat seen on an AMD > host? Or perhaps there's another reason for the difference due to the > (too) simple selftest? Yes, but it's likely not due to any of the paths listed above. Obviously accounting the flush in kvm_mmu_load() will elevate the count, but it will affect VMX and SVM identically. Given that you see 0 on SVM and a low number on VMX, my money is on the difference being that VMX accounts the TLB flush that occurs on vCPU migration. vmx_vcpu_load_vmcs() makes a KVM_REQ_TLB_FLUSH request, whereas svm_vcpu_load() resets asid_generation but doesn't increment the stats. > In the case of svm_tlb_flush, it seems like the tlb flush is deferred > anyway since the response to setting a tlb flush control bit in the > VMCB is not acted upon until entering the guest. So it seems we could > count tlb flushes on svm more easily by incrementing the counter by > checking the control bit before KVM_RUN. Though perhaps there's > another case we'd like to count as tlb flush when the guest switches > ASID (where would we track this?). > > Would switching to this alternative for incrementing tlb_flush stat in > svm be much different than what we do right now? I think a better option is to keep the current accounting, defer flushes when possible to naturally fix accounting, and then fix the remaining one off cases, e.g. kvm_mmu_load() and svm_vcpu_load(). I can prep a small series unless you want the honors?