Re: [PATCH 0/2] KVM: x86/mmu: .change_pte() optimization in TDP MMU

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Sep 05, 2023 at 01:18:23PM -0700, Sean Christopherson wrote:
> On Mon, Sep 04, 2023, Yan Zhao wrote:
> > ...
> > > > Actually, I don't even completely understand how you're seeing CoW behavior in
> > > > the first place.  No sane guest should blindly read (or execute) uninitialized
> > > > memory.  IIUC, you're not running a Windows guest, and even if you are, AFAIK
> > > > QEMU doesn't support Hyper-V's enlightment that lets the guest assume memory has
> > > > been zeroed by the hypervisor.  If KSM is to blame, then my answer it to turn off
> > > > KSM, because turning on KSM is antithetical to guest performance (not to mention
> > > > that KSM is wildly insecure for the guest, especially given the number of speculative
> > > > execution attacks these days).
> > > I'm running a linux guest.
> > > KSM is not turned on both in guest and host.
> > > Both guest and host have turned on transparent huge page.
> > > 
> > > The guest first reads a GFN in a writable memslot (which is for "pc.ram"),
> > > which will cause
> > >     (1) KVM first sends a GUP without FOLL_WRITE, leaving a huge_zero_pfn or a zero-pfn
> > >         mapped.
> > >     (2) KVM calls get_user_page_fast_only() with FOLL_WRITE as the memslot is writable,
> > >         which will fail
> > > 
> > > The guest then writes the GFN.
> > > This step will trigger (huge pmd split for huge page case) and .change_pte().
> > > 
> > > My guest is surely a sane guest. But currently I can't find out why
> > > certain pages are read before write.
> > > Will return back to you the reason after figuring it out after my long vacation.
> > Finally I figured out the reason.
> > 
> > Except 4 pages were read before written from vBIOS (I just want to skip finding
> > out why vBIOS does this), the remaining thousands of pages were read before
> > written from the guest Linux kernel.
> > 
> > If the guest kernel were configured with "CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y" or
> > "CONFIG_INIT_ON_FREE_DEFAULT_ON=y", or booted with param "init_on_alloc=1" or
> > "init_on_free=1", this read before written problem goes away.
> > 
> > However, turning on those configs has side effects as said in kernel config
> > message:
> > "all page allocator and slab allocator memory will be zeroed when allocated,
> > eliminating many kinds of "uninitialized heap memory" flaws, especially
> > heap content exposures. The performance impact varies by workload, but most
> > cases see <1% impact. Some synthetic workloads have measured as high as 7%."
> > 
> > If without the above two configs, or if with init_on_alloc=0 && init_on_free=0,
> > the root cause for all the reads of uninitialized heap memory are related to
> 
> Yeah, forcing the guest to pre-initialize all memory is a hack-a-fix and not a
> real solution.
> 
> > page cache pages of the guest virtual devices (specifically the virtual IDE
> > device in my case).
> 
> Why are you using IDE?  IDE is comically slow compared to VirtIO, and VirtIO has
> been broadly supported for something like 15 years, even on Windows.

I don't know why I'm using IDE.
Maybe just because it's of my default settings for years :)

And I sent this series was just because I think each guest write in this case has
to cause two EPT violations is wasted.

BTW, not only for IDE devices, I think any devices with DMA mask less than max GPA
width will cause the same problem. And also true when "swiotlb=force" is enabled.

> 
> > The reason for this unconditional read of page into bounce buffer
> > (caused by "swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE)")
> > is explained in the code:
> > 
> > /*
> >  * When dir == DMA_FROM_DEVICE we could omit the copy from the orig
> >  * to the tlb buffer, if we knew for sure the device will
> >  * overwrite the entire current content. But we don't. Thus
> >  * unconditional bounce may prevent leaking swiotlb content (i.e.
> >  * kernel memory) to user-space.
> >  */
> > 
> > If we neglect this risk and do changes like
> > -       swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);
> > +       if (dir != DMA_FROM_DEVICE)
> > +               swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);
> > 
> > the issue of pages read before written from guest kernel just went away.
> > 
> > I don't think it's a swiotlb bug, because to prevent leaking swiotlb
> > content, if target page content is not copied firstly to the swiotlb's
> > bounce buffer, then the bounce buffer needs to be initialized to 0.
> > However, swiotlb_tbl_map_single() does not know whether the target page
> > is initialized or not. Then, it would cause page content to be trimmed
> > if device does not overwrite the entire memory.
> > 
> > > 
> > > > 
> > > > If there's something else going on, i.e. if your VM really is somehow generating
> > > > reads before writes, and if we really want to optimize use cases that can't use
> > > > hugepages for whatever reason, I would much prefer to do something like add a
> > > > memslot flag to state that the memslot should *always* be mapped writable.  Because
> > > Will check if this flag is necessary after figuring out the reason.
> > As explained above, I think it's a valid and non-rare practice in guest kernel to
> > cause read of uninitialized heap memory.
> 
> Heh, for some definitions of valid.  
> 
> > And the host admin may not know exactly when it's appropriate to apply the
> > memslot flag.
> 
> Yeah, a memslot flag is too fine-grained.
> 
> > Do you think it's good to make the "always write_fault = true" solution enabled
> > by default?
> 
> Sadly, probably not, because that would regress setups that do want to utilize
> CoW, e.g. I'm pretty sure requesting everything to be writable would be a big
> negative for KSM.
> 
> I do think we should add a KVM knob though.  Regardless of the validity or frequency
> of the guest behavior, and even though userspace can also workaround this by
> preallocating guest memory, I am struggling to think of any reason outside of KSM
> where CoW semantics are desirable.
> 
> Ooh, actually, maybe we could do
> 
> 	static bool <name_tbd> = !IS_ENABLED(CONFIG_KSM);
> 
> and then cross our fingers that that doesn't regress some other funky setups.
Oh, this "always write_fault" solution may be not friendly to UFFD write protected pages too.



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux