On Tue, Sep 05, 2023 at 01:18:23PM -0700, Sean Christopherson wrote: > On Mon, Sep 04, 2023, Yan Zhao wrote: > > ... > > > > Actually, I don't even completely understand how you're seeing CoW behavior in > > > > the first place. No sane guest should blindly read (or execute) uninitialized > > > > memory. IIUC, you're not running a Windows guest, and even if you are, AFAIK > > > > QEMU doesn't support Hyper-V's enlightment that lets the guest assume memory has > > > > been zeroed by the hypervisor. If KSM is to blame, then my answer it to turn off > > > > KSM, because turning on KSM is antithetical to guest performance (not to mention > > > > that KSM is wildly insecure for the guest, especially given the number of speculative > > > > execution attacks these days). > > > I'm running a linux guest. > > > KSM is not turned on both in guest and host. > > > Both guest and host have turned on transparent huge page. > > > > > > The guest first reads a GFN in a writable memslot (which is for "pc.ram"), > > > which will cause > > > (1) KVM first sends a GUP without FOLL_WRITE, leaving a huge_zero_pfn or a zero-pfn > > > mapped. > > > (2) KVM calls get_user_page_fast_only() with FOLL_WRITE as the memslot is writable, > > > which will fail > > > > > > The guest then writes the GFN. > > > This step will trigger (huge pmd split for huge page case) and .change_pte(). > > > > > > My guest is surely a sane guest. But currently I can't find out why > > > certain pages are read before write. > > > Will return back to you the reason after figuring it out after my long vacation. > > Finally I figured out the reason. > > > > Except 4 pages were read before written from vBIOS (I just want to skip finding > > out why vBIOS does this), the remaining thousands of pages were read before > > written from the guest Linux kernel. > > > > If the guest kernel were configured with "CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y" or > > "CONFIG_INIT_ON_FREE_DEFAULT_ON=y", or booted with param "init_on_alloc=1" or > > "init_on_free=1", this read before written problem goes away. > > > > However, turning on those configs has side effects as said in kernel config > > message: > > "all page allocator and slab allocator memory will be zeroed when allocated, > > eliminating many kinds of "uninitialized heap memory" flaws, especially > > heap content exposures. The performance impact varies by workload, but most > > cases see <1% impact. Some synthetic workloads have measured as high as 7%." > > > > If without the above two configs, or if with init_on_alloc=0 && init_on_free=0, > > the root cause for all the reads of uninitialized heap memory are related to > > Yeah, forcing the guest to pre-initialize all memory is a hack-a-fix and not a > real solution. > > > page cache pages of the guest virtual devices (specifically the virtual IDE > > device in my case). > > Why are you using IDE? IDE is comically slow compared to VirtIO, and VirtIO has > been broadly supported for something like 15 years, even on Windows. I don't know why I'm using IDE. Maybe just because it's of my default settings for years :) And I sent this series was just because I think each guest write in this case has to cause two EPT violations is wasted. BTW, not only for IDE devices, I think any devices with DMA mask less than max GPA width will cause the same problem. And also true when "swiotlb=force" is enabled. > > > The reason for this unconditional read of page into bounce buffer > > (caused by "swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE)") > > is explained in the code: > > > > /* > > * When dir == DMA_FROM_DEVICE we could omit the copy from the orig > > * to the tlb buffer, if we knew for sure the device will > > * overwrite the entire current content. But we don't. Thus > > * unconditional bounce may prevent leaking swiotlb content (i.e. > > * kernel memory) to user-space. > > */ > > > > If we neglect this risk and do changes like > > - swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE); > > + if (dir != DMA_FROM_DEVICE) > > + swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE); > > > > the issue of pages read before written from guest kernel just went away. > > > > I don't think it's a swiotlb bug, because to prevent leaking swiotlb > > content, if target page content is not copied firstly to the swiotlb's > > bounce buffer, then the bounce buffer needs to be initialized to 0. > > However, swiotlb_tbl_map_single() does not know whether the target page > > is initialized or not. Then, it would cause page content to be trimmed > > if device does not overwrite the entire memory. > > > > > > > > > > > > > If there's something else going on, i.e. if your VM really is somehow generating > > > > reads before writes, and if we really want to optimize use cases that can't use > > > > hugepages for whatever reason, I would much prefer to do something like add a > > > > memslot flag to state that the memslot should *always* be mapped writable. Because > > > Will check if this flag is necessary after figuring out the reason. > > As explained above, I think it's a valid and non-rare practice in guest kernel to > > cause read of uninitialized heap memory. > > Heh, for some definitions of valid. > > > And the host admin may not know exactly when it's appropriate to apply the > > memslot flag. > > Yeah, a memslot flag is too fine-grained. > > > Do you think it's good to make the "always write_fault = true" solution enabled > > by default? > > Sadly, probably not, because that would regress setups that do want to utilize > CoW, e.g. I'm pretty sure requesting everything to be writable would be a big > negative for KSM. > > I do think we should add a KVM knob though. Regardless of the validity or frequency > of the guest behavior, and even though userspace can also workaround this by > preallocating guest memory, I am struggling to think of any reason outside of KSM > where CoW semantics are desirable. > > Ooh, actually, maybe we could do > > static bool <name_tbd> = !IS_ENABLED(CONFIG_KSM); > > and then cross our fingers that that doesn't regress some other funky setups. Oh, this "always write_fault" solution may be not friendly to UFFD write protected pages too.