On Mon, 23 Dec 2024 11:59:06 -0500 Peter Xu <peterx@xxxxxxxxxx> wrote: > On Mon, Dec 23, 2024 at 07:37:46AM +0000, Athul Krishna wrote: > > Can confirm. Reverting f9e54c3a2f5b from v6.13-rc1 fixed the problem. > > I suppose Alex should have some more thoughts, probably after the holidays. > Before that, one quick question to ask.. Yeah, apologies in advance for latency over the next couple weeks. > > -------- Original Message -------- > > On 23/12/24 04:06, Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > > > > Forwarding since not everybody follows bugzilla. Apparently bisected > > > to f9e54c3a2f5b ("vfio/pci: implement huge_fault support"). > > > > > > Athul, f9e54c3a2f5b appears to revert cleanly from v6.13-rc1. Can you > > > verify that reverting it is enough to avoid these artifacts? > > > > > > #regzbot introduced: f9e54c3a2f5b ("vfio/pci: implement huge_fault support") > > > > > > ----- Forwarded message from bugzilla-daemon@xxxxxxxxxx ----- > > > > > > Date: Sat, 21 Dec 2024 10:10:02 +0000 > > > From: bugzilla-daemon@xxxxxxxxxx > > > To: bjorn@xxxxxxxxxxxxxxxxxxxxxxx > > > Subject: [Bug 219619] New: vfio-pci: screen graphics artifacts after 6.12 kernel upgrade > > > Message-ID: <bug-219619-41252@xxxxxxxxxxxxxxxxxxxxxxxxx/> > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=219619 > > > > > > Bug ID: 219619 > > > Summary: vfio-pci: screen graphics artifacts after 6.12 kernel > > > upgrade > > > Product: Drivers > > > Version: 2.5 > > > Hardware: AMD > > > OS: Linux > > > Status: NEW > > > Severity: normal > > > Priority: P3 > > > Component: PCI > > > Assignee: drivers_pci@xxxxxxxxxxxxxxxxxxxx > > > Reporter: athul.krishna.kr@xxxxxxxxxxxxxx > > > Regression: No > > > > > > Created attachment 307382 > > > --> https://bugzilla.kernel.org/attachment.cgi?id=307382&action=edit > > > dmesg > > vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs Is the reset recovery message seen even with the suspect commit reverted? Timestamps here would be useful for correlation. > pcieport 0000:00:01.1: AER: Multiple Uncorrectable (Non-Fatal) error message received from 0000:03:00.1 > vfio-pci 0000:03:00.0: PCIe Bus Error: severity=Uncorrectable (Non-Fatal), type=Transaction Layer, (Requester ID) > vfio-pci 0000:03:00.0: device [1002:73ef] error status/mask=00100000/00000000 > vfio-pci 0000:03:00.0: [20] UnsupReq (First) > vfio-pci 0000:03:00.0: AER: TLP Header: 60001004 000000ff 0000007d fe7eb000 > vfio-pci 0000:03:00.1: PCIe Bus Error: severity=Uncorrectable (Non-Fatal), type=Transaction Layer, (Requester ID) > vfio-pci 0000:03:00.1: device [1002:ab28] error status/mask=00100000/00000000 > vfio-pci 0000:03:00.1: [20] UnsupReq (First) > vfio-pci 0000:03:00.1: AER: TLP Header: 60001004 000000ff 0000007d fe7eb000 > vfio-pci 0000:03:00.1: AER: Error of this Agent is reported first > pcieport 0000:02:00.0: AER: broadcast error_detected message > pcieport 0000:02:00.0: AER: broadcast mmio_enabled message > pcieport 0000:02:00.0: AER: broadcast resume message > pcieport 0000:02:00.0: AER: device recovery successful > pcieport 0000:02:00.0: AER: broadcast error_detected message > pcieport 0000:02:00.0: AER: broadcast mmio_enabled message > pcieport 0000:02:00.0: AER: broadcast resume message > pcieport 0000:02:00.0: AER: device recovery successful > > > > > > > Device: Asus Zephyrus GA402RJ > > > CPU: Ryzen 7 6800HS > > > GPU: RX 6700S > > > Kernel: 6.13.0-rc3-g8faabc041a00 > > > > > > Problem: > > > Launching games or gpu bench-marking tools in qemu windows 11 vm will cause > > > screen artifacts, ultimately qemu will pause with unrecoverable error. > > Is there more information on what setup can reproduce it? > > For example, does it only happen with Windows guests? Does the GPU > vendor/model matter? And the CPU vendor, this was predominately tested by me on Intel + NVIDIA. I'm also not seeing any similar reports on r/VFIO, which is a bit strange as there are a lot of bleeding edge users there. The bz is reported against 6.13.0-rc3-g8faabc041a00 and a revert against v6.13-rc1 was reported as stable. Has this actually been confirmed on v6.12, or might something in v6.13-rc have introduced a new issue? > > > Commit: > > > f9e54c3a2f5b79ecc57c7bc7d0d3521e461a2101 is the first bad commit > > > commit f9e54c3a2f5b79ecc57c7bc7d0d3521e461a2101 > > > Author: Alex Williamson <alex.williamson@xxxxxxxxxx> > > > Date: Mon Aug 26 16:43:53 2024 -0400 > > > > > > vfio/pci: implement huge_fault support > > Personally I have no clue yet on how this could affect it. I was initially > worrying on any implicit cache mode changes on the mappings, but I don't > think any of such was involved in this specific change. > > This commit majorly does two things: (1) allow 2M/1G mappings for BARs > instead of small 4Ks always, and (2) always lazy faults rather than > "install everything in the 1st fault". Maybe one of the two could have > some impact in some way. Athul, can you test reverting both f9e54c3a2f5b and d71a989cf5d9? That would provide the faulting behavior without yet making use of huge pfnmaps. Thanks, Alex