On Thu, Sep 05, 2024 at 05:43:17PM +0800, Yan Zhao wrote: > On Wed, Sep 04, 2024 at 05:41:06PM -0700, Sean Christopherson wrote: > > On Wed, Sep 04, 2024, Yan Zhao wrote: > > > On Wed, Sep 04, 2024 at 10:28:02AM +0800, Yan Zhao wrote: > > > > On Tue, Sep 03, 2024 at 06:20:27PM +0200, Vitaly Kuznetsov wrote: > > > > > Sean Christopherson <seanjc@xxxxxxxxxx> writes: > > > > > > > > > > > On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote: > > > > > >> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64) > > > > > >> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R) > > > > > >> Silver 4410Y". > > > > > > > > > > > > Has this been reproduced on any other hardware besides SPR? I.e. did we stumble > > > > > > on another hardware issue? > > > > > > > > > > Very possible, as according to Yan Zhao this doesn't reproduce on at > > > > > least "Coffee Lake-S". Let me try to grab some random hardware around > > > > > and I'll be back with my observations. > > > > > > > > Update some new findings from my side: > > > > > > > > BAR 0 of bochs VGA (fb_map) is used for frame buffer, covering phys range > > > > from 0xfd000000 to 0xfe000000. > > > > > > > > On "Sapphire Rapids XCC": > > > > > > > > 1. If KVM forces this fb_map range to be WC+IPAT, installer/gdm can launch > > > > correctly. > > > > i.e. > > > > if (gfn >= 0xfd000 && gfn < 0xfe000) { > > > > return (MTRR_TYPE_WRCOMB << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > > > > } > > > > return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT; > > > > > > > > 2. If KVM forces this fb_map range to be UC+IPAT, installer failes to show / gdm > > > > restarts endlessly. (though on Coffee Lake-S, installer/gdm can launch > > > > correctly in this case). > > > > > > > > 3. On starting GDM, ttm_kmap_iter_linear_io_init() in guest is called to set > > > > this fb_map range as WC, with > > > > iosys_map_set_vaddr_iomem(&iter_io->dmap, ioremap_wc(mem->bus.offset, mem->size)); > > > > > > > > However, during bochs_pci_probe()-->bochs_load()-->bochs_hw_init(), pfns for > > > > this fb_map has been reserved as uc- by ioremap(). > > > > Then, the ioremap_wc() during starting GDM will only map guest PAT with UC-. > > > > > > > > So, with KVM setting WB (no IPAT) to this fb_map range, the effective > > > > memory type is UC- and installer/gdm restarts endlessly. > > > > > > > > 4. If KVM sets WB (no IPAT) to this fb_map range, and changes guest bochs driver > > > > to call ioremap_wc() instead in bochs_hw_init(), gdm can launch correctly. > > > > (didn't verify the installer's case as I can't update the driver in that case). > > > > > > > > The reason is that the ioremap_wc() called during starting GDM will no longer > > > > meet conflict and can map guest PAT as WC. > > > > Huh. The upside of this is that it sounds like there's nothing broken with WC > > or self-snoop. > Considering a different perspective, the fb_map range is used as frame buffer > (vram), with the guest writing to this range and the host reading from it. > If the issue were related to self-snooping, we would expect the VNC window to > display distorted data. However, the observed behavior is that the GDM window > shows up correctly for a sec and restarts over and over. > > So, do you think we can simply fix this issue by calling ioremap_wc() for the > frame buffer/vram range in bochs driver, as is commonly done in other gpu > drivers? > > --- a/drivers/gpu/drm/tiny/bochs.c > +++ b/drivers/gpu/drm/tiny/bochs.c > @@ -261,7 +261,9 @@ static int bochs_hw_init(struct drm_device *dev) > if (pci_request_region(pdev, 0, "bochs-drm") != 0) > DRM_WARN("Cannot request framebuffer, boot fb still active?\n"); > > - bochs->fb_map = ioremap(addr, size); > + bochs->fb_map = ioremap_wc(addr, size); > if (bochs->fb_map == NULL) { > DRM_ERROR("Cannot map framebuffer\n"); > return -ENOMEM; > > > > > > > > WIP to find out why effective UC in fb_map range will make gdm to restart > > > > endlessly. > > > Not sure whether it's simply because UC is too slow. > > > > > > T=Test execution time of a selftest in which guest writes to a GPA for > > > 0x1000000UL times > > > > > > | Sapphire Rapids XCC | Coffee Lake-S > > > --------------|----------------------|----------------- > > > KVM UC+IPAT | T=0m4.530s | T=0m0.622s > > > > Woah. Have you tried testing MOVDIR64 and/or WT? E.g. to see if the problem is > > with UC specifically, or if it occurs with any accesses that immediately write > > through to main memory. > > > > > --------------|----------------------|----------------- > > > KVM WC+IPAT | T=0m0.149s | T=0m0.176s > > > --------------|----------------------|----------------- > > > KVM WB+IPAT | T=0m0.148s | T=0m0.148s > > > ------------------------------------------------------ > > I re-run all the tests and collected an averaged data (10 times each) as > below (previous data was just a single-run score): > > > T=Test execution time of a selftest in which guest writes to a GPA for > 0x1000000UL times with WRITE_ONCE > > KVM memtype | Sapphire Rapids XCC | Coffee Lake-S > -------------|---------------------|---------------- > WB+IPAT | T=0.1511s | T=0.1661s > -------------|---------------------|---------------- > WC+IPAT | T=0.1411s | T=0.1656s > -------------|---------------------|---------------- > WT+IPAT | T=3.7527s | T=0.6156s > -------------|---------------------|---------------- > WP+IPAT | T=4.4663s | T=0.6203s > -------------|---------------------|---------------- > UC+IPAT | T=3.4632s | T=0.5868s > > > T=Test execution time of a selftest in which guest writes to a GPA for > 0x1000000UL times with movdir64b. > > (Coffee Lake-S has no feature movdir64). > > KVM memtype | Sapphire Rapids XCC | Coffee Lake-S > -------------|---------------------|---------------- > WB+IPAT | T=2.6142s | / > -------------|---------------------|---------------- > WC+IPAT | T=2.8919s | / > -------------|---------------------|---------------- > WT+IPAT | T=3.0966s | / > -------------|---------------------|---------------- > WP+IPAT | T=2.4933s | / > -------------|---------------------|---------------- > UC+IPAT | T=3.4606s | / > Up to now, I think I have root caused this issue. Status before this update: In either ubuntu or centos, on "Sapphire Rapids XCC" - gdm fails to launch gnome-shell when wayland is enabled, when effective memory type is UC/UC-. - gdm is able launch gnome-shell correctly when wayland is enabled, when effective memory type is WB or WC. - gdm is able launch gnome-shell correctly when wayland is not enabled, with any effective memory type. Update: 1. I tried KVM memtype = WT + IPAT for this framebuffer range, gdm fails to launch gnome-shell when wayland is enabled. Since the only difference between WT and WB is that write in WT is slow, the failure should not be self-snoop issue. 2. The current bochs driver calls ioremap() to map framebuffer range. On x86 architectures, ioremap() maps VA with PAT=UC- and invokes memtype_reserve() to reserve the memory type as UC- for the physical range. This reservation can cause subsequent calls to ioremap_wc() to fail to map the VA with PAT=WC to the same framebuffer range in ttm_kmap_iter_linear_io_init(). Consequently, the operation drm_gem_vram_bo_driver_move() become significantly slow on platforms where UC memory access is slow. When host KVM honors guest PAT memory types, the effective memory type for this framebuffer range is - WC when ioremap_wc() is used in driver probing phase - UC- when ioremap() is used. I measured the data below for drm_gem_vram_bo_driver_move() which does memset to this framebuffer range with size 0x3e8000. --------------------------------------------------------------- | | in bochs_hw_init() | | | ioremap() | ioremap_wc() | |-------------------------------|----------------|--------------| | cycles of | 2227.4M | 17.8M | | drm_gem_vram_bo_driver_move() | | | |-------------------------------|----------------|--------------| | time of | 1.24s | 0.01s | | drm_gem_vram_bo_driver_move() | | | --------------------------------------------------------------- drm_gem_vram_bo_driver_move ttm_bo_move_memcpy() ttm_kmap_iter_linear_io_init() iosys_map_set_vaddr_iomem(&iter_io->dmap, ioremap_wc(mem->bus.offset,mem->size)); ttm_move_memcpy memset_io or drm_memcpy_from_wc If I comment out the memset_io() and drm_memcpy_from_wc() in ttm_move_memcpy(), drm_gem_vram_bo_driver_move() can be very fast and gdm is able to launch gnome-shell and login successfully, though sometime the screen is a little blurred. 3. I sent a fix at [1] to let guest bochs driver map the framebuffer with PAT=WC for kernel access. [1] https://lore.kernel.org/all/20240909051529.26776-1-yan.y.zhao@xxxxxxxxx/