On Mon, Sep 04, 2023, Yan Zhao wrote: > ... > > > Actually, I don't even completely understand how you're seeing CoW behavior in > > > the first place. No sane guest should blindly read (or execute) uninitialized > > > memory. IIUC, you're not running a Windows guest, and even if you are, AFAIK > > > QEMU doesn't support Hyper-V's enlightment that lets the guest assume memory has > > > been zeroed by the hypervisor. If KSM is to blame, then my answer it to turn off > > > KSM, because turning on KSM is antithetical to guest performance (not to mention > > > that KSM is wildly insecure for the guest, especially given the number of speculative > > > execution attacks these days). > > I'm running a linux guest. > > KSM is not turned on both in guest and host. > > Both guest and host have turned on transparent huge page. > > > > The guest first reads a GFN in a writable memslot (which is for "pc.ram"), > > which will cause > > (1) KVM first sends a GUP without FOLL_WRITE, leaving a huge_zero_pfn or a zero-pfn > > mapped. > > (2) KVM calls get_user_page_fast_only() with FOLL_WRITE as the memslot is writable, > > which will fail > > > > The guest then writes the GFN. > > This step will trigger (huge pmd split for huge page case) and .change_pte(). > > > > My guest is surely a sane guest. But currently I can't find out why > > certain pages are read before write. > > Will return back to you the reason after figuring it out after my long vacation. > Finally I figured out the reason. > > Except 4 pages were read before written from vBIOS (I just want to skip finding > out why vBIOS does this), the remaining thousands of pages were read before > written from the guest Linux kernel. > > If the guest kernel were configured with "CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y" or > "CONFIG_INIT_ON_FREE_DEFAULT_ON=y", or booted with param "init_on_alloc=1" or > "init_on_free=1", this read before written problem goes away. > > However, turning on those configs has side effects as said in kernel config > message: > "all page allocator and slab allocator memory will be zeroed when allocated, > eliminating many kinds of "uninitialized heap memory" flaws, especially > heap content exposures. The performance impact varies by workload, but most > cases see <1% impact. Some synthetic workloads have measured as high as 7%." > > If without the above two configs, or if with init_on_alloc=0 && init_on_free=0, > the root cause for all the reads of uninitialized heap memory are related to Yeah, forcing the guest to pre-initialize all memory is a hack-a-fix and not a real solution. > page cache pages of the guest virtual devices (specifically the virtual IDE > device in my case). Why are you using IDE? IDE is comically slow compared to VirtIO, and VirtIO has been broadly supported for something like 15 years, even on Windows. > The reason for this unconditional read of page into bounce buffer > (caused by "swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE)") > is explained in the code: > > /* > * When dir == DMA_FROM_DEVICE we could omit the copy from the orig > * to the tlb buffer, if we knew for sure the device will > * overwrite the entire current content. But we don't. Thus > * unconditional bounce may prevent leaking swiotlb content (i.e. > * kernel memory) to user-space. > */ > > If we neglect this risk and do changes like > - swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE); > + if (dir != DMA_FROM_DEVICE) > + swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE); > > the issue of pages read before written from guest kernel just went away. > > I don't think it's a swiotlb bug, because to prevent leaking swiotlb > content, if target page content is not copied firstly to the swiotlb's > bounce buffer, then the bounce buffer needs to be initialized to 0. > However, swiotlb_tbl_map_single() does not know whether the target page > is initialized or not. Then, it would cause page content to be trimmed > if device does not overwrite the entire memory. > > > > > > > > > If there's something else going on, i.e. if your VM really is somehow generating > > > reads before writes, and if we really want to optimize use cases that can't use > > > hugepages for whatever reason, I would much prefer to do something like add a > > > memslot flag to state that the memslot should *always* be mapped writable. Because > > Will check if this flag is necessary after figuring out the reason. > As explained above, I think it's a valid and non-rare practice in guest kernel to > cause read of uninitialized heap memory. Heh, for some definitions of valid. > And the host admin may not know exactly when it's appropriate to apply the > memslot flag. Yeah, a memslot flag is too fine-grained. > Do you think it's good to make the "always write_fault = true" solution enabled > by default? Sadly, probably not, because that would regress setups that do want to utilize CoW, e.g. I'm pretty sure requesting everything to be writable would be a big negative for KSM. I do think we should add a KVM knob though. Regardless of the validity or frequency of the guest behavior, and even though userspace can also workaround this by preallocating guest memory, I am struggling to think of any reason outside of KSM where CoW semantics are desirable. Ooh, actually, maybe we could do static bool <name_tbd> = !IS_ENABLED(CONFIG_KSM); and then cross our fingers that that doesn't regress some other funky setups.