On Thu, Aug 17, 2023 at 10:53:25AM -0700, Sean Christopherson wrote: ... > > What I observed is that each vCPU write to a COW page in primary MMU > > will lead to twice TDP page faults. > > Then, I just update the secondary MMU during the first TDP page fault > > to avoid the second one. > > It's not a blind prefetch (I checked the vCPU to ensure it's triggered > > by a vCPU operation as much as possible) > > Yes, that's part of the complexity I don't like. > > > and it can benefit guests who doesn't explicitly request a prefault memory as > > write. > > Yes, I'm arguing that the benefit isn't significant, and that the use cases it > might benefit aren't things people care about optimizing. > > I'm very skeptical that shaving those 8000 VM-Exits will translate to a meaningful > reduction in guest boot time, let alone scale beyond very specific scenarios and > configurations, which again, are likely suboptimal in nature. Actually, they most > definitely are suboptimal, because the fact that this provides any benefit > whatsoever means that either your VM isn't being backed with hugepages, or it's > being backed with THP and transparent_hugepage/use_zero_page is enabled (and thus > is generating CoW behavior). Yes, it's being backed with THP and transparent_hugepage/use_zero_page is enabled. > > Enabling THP or using HugeTLB (which again can be done on a subset of guest memory) > will have a far, far bigger impact on guest performance. Ditto for disabling > using the huge zero_page when backing VMs with THP (any page touched by the guest > is all but guaranteed to be written sooner than later, so using the zero_page > doesn't make a whole lot of sense). > > E.g. a single CoW operation will take mmu_lock for write three times: > invalidate_range_start(), change_pte(), and invalidate_range_end(), not to mention > the THP zero_page CoW will first fault-in a read-only mapping, then split that > mapping, and then do CoW on the 4KiB PTEs, which is *really* suboptimal. > > Actually, I don't even completely understand how you're seeing CoW behavior in > the first place. No sane guest should blindly read (or execute) uninitialized > memory. IIUC, you're not running a Windows guest, and even if you are, AFAIK > QEMU doesn't support Hyper-V's enlightment that lets the guest assume memory has > been zeroed by the hypervisor. If KSM is to blame, then my answer it to turn off > KSM, because turning on KSM is antithetical to guest performance (not to mention > that KSM is wildly insecure for the guest, especially given the number of speculative > execution attacks these days). I'm running a linux guest. KSM is not turned on both in guest and host. Both guest and host have turned on transparent huge page. The guest first reads a GFN in a writable memslot (which is for "pc.ram"), which will cause (1) KVM first sends a GUP without FOLL_WRITE, leaving a huge_zero_pfn or a zero-pfn mapped. (2) KVM calls get_user_page_fast_only() with FOLL_WRITE as the memslot is writable, which will fail The guest then writes the GFN. This step will trigger (huge pmd split for huge page case) and .change_pte(). My guest is surely a sane guest. But currently I can't find out why certain pages are read before write. Will return back to you the reason after figuring it out after my long vacation. > > If there's something else going on, i.e. if your VM really is somehow generating > reads before writes, and if we really want to optimize use cases that can't use > hugepages for whatever reason, I would much prefer to do something like add a > memslot flag to state that the memslot should *always* be mapped writable. Because Will check if this flag is necessary after figuring out the reason. > outside of setups that use KSM, the only reason I can think of to not map memory > writable straightaway is if userspace somehow knows the guest isn't going to write > that memory. > > If it weren't for KSM, and if it wouldn't potentially be a breaking change, I > would even go so far as to say that KVM should always map writable memslots as > writable in the guest. > > E.g. minus the uAPI, this is a lot simpler to implement and maintain. > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index dfbaafbe3a00..6c4640483881 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -2727,10 +2727,14 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn, > return KVM_PFN_NOSLOT; > } > > - /* Do not map writable pfn in the readonly memslot. */ > - if (writable && memslot_is_readonly(slot)) { > - *writable = false; > - writable = NULL; > + if (writable) { > + if (memslot_is_readonly(slot)) { > + *writable = false; > + writable = NULL; > + } else if (memslot_is_always_writable(slot)) { > + *writable = true; > + write_fault = true; > + } > } > > return hva_to_pfn(addr, atomic, interruptible, async, write_fault, > > > And FWIW, removing .change_pte() entirely, even without any other optimizations, > will also benefit those guests, as it will remove a source of mmu_lock contention > along with all of the overhead of invoking callbacks, walking memslots, etc. And > removing .change_pte() will benefit *all* guests by eliminating unrelated callbacks, > i.e. callbacks when memory for the VMM takes a CoW fault. > If with above "always write_fault = true" solution, I think it's better. > So yeah, unless I'm misunderstanding the bigger picture, the more I look at this, > the more I'm against it.