Re: [bugzilla-daemon@xxxxxxxxxx: [Bug 219619] New: vfio-pci: screen graphics artifacts after 6.12 kernel upgrade]

Peter Xu <peterx@xxxxxxxxxx> · Mon, 23 Dec 2024 11:59:06 -0500

On Mon, Dec 23, 2024 at 07:37:46AM +0000, Athul Krishna wrote:
> Can confirm. Reverting f9e54c3a2f5b from v6.13-rc1 fixed the problem.

I suppose Alex should have some more thoughts, probably after the holidays.
Before that, one quick question to ask..

> 
> -------- Original Message --------
> On 23/12/24 04:06, Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> 
> >  Forwarding since not everybody follows bugzilla.  Apparently bisected
> >  to f9e54c3a2f5b ("vfio/pci: implement huge_fault support").
> >  
> >  Athul, f9e54c3a2f5b appears to revert cleanly from v6.13-rc1.  Can you
> >  verify that reverting it is enough to avoid these artifacts?
> >  
> >  #regzbot introduced: f9e54c3a2f5b ("vfio/pci: implement huge_fault support")
> >  
> >  ----- Forwarded message from bugzilla-daemon@xxxxxxxxxx -----
> >  
> >  Date: Sat, 21 Dec 2024 10:10:02 +0000
> >  From: bugzilla-daemon@xxxxxxxxxx
> >  To: bjorn@xxxxxxxxxxxxxxxxxxxxxxx
> >  Subject: [Bug 219619] New: vfio-pci: screen graphics artifacts after 6.12 kernel upgrade
> >  Message-ID: <bug-219619-41252@xxxxxxxxxxxxxxxxxxxxxxxxx/>
> >  
> >  https://bugzilla.kernel.org/show_bug.cgi?id=219619
> >  
> >              Bug ID: 219619
> >             Summary: vfio-pci: screen graphics artifacts after 6.12 kernel
> >                      upgrade
> >             Product: Drivers
> >             Version: 2.5
> >            Hardware: AMD
> >                  OS: Linux
> >              Status: NEW
> >            Severity: normal
> >            Priority: P3
> >           Component: PCI
> >            Assignee: drivers_pci@xxxxxxxxxxxxxxxxxxxx
> >            Reporter: athul.krishna.kr@xxxxxxxxxxxxxx
> >          Regression: No
> >  
> >  Created attachment 307382
> >    --> https://bugzilla.kernel.org/attachment.cgi?id=307382&action=edit
> >  dmesg

vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
pcieport 0000:00:01.1: AER: Multiple Uncorrectable (Non-Fatal) error message received from 0000:03:00.1
vfio-pci 0000:03:00.0: PCIe Bus Error: severity=Uncorrectable (Non-Fatal), type=Transaction Layer, (Requester ID)
vfio-pci 0000:03:00.0:   device [1002:73ef] error status/mask=00100000/00000000
vfio-pci 0000:03:00.0:    [20] UnsupReq               (First)
vfio-pci 0000:03:00.0: AER:   TLP Header: 60001004 000000ff 0000007d fe7eb000
vfio-pci 0000:03:00.1: PCIe Bus Error: severity=Uncorrectable (Non-Fatal), type=Transaction Layer, (Requester ID)
vfio-pci 0000:03:00.1:   device [1002:ab28] error status/mask=00100000/00000000
vfio-pci 0000:03:00.1:    [20] UnsupReq               (First)
vfio-pci 0000:03:00.1: AER:   TLP Header: 60001004 000000ff 0000007d fe7eb000
vfio-pci 0000:03:00.1: AER:   Error of this Agent is reported first
pcieport 0000:02:00.0: AER: broadcast error_detected message
pcieport 0000:02:00.0: AER: broadcast mmio_enabled message
pcieport 0000:02:00.0: AER: broadcast resume message
pcieport 0000:02:00.0: AER: device recovery successful
pcieport 0000:02:00.0: AER: broadcast error_detected message
pcieport 0000:02:00.0: AER: broadcast mmio_enabled message
pcieport 0000:02:00.0: AER: broadcast resume message
pcieport 0000:02:00.0: AER: device recovery successful

> >  
> >  Device: Asus Zephyrus GA402RJ
> >  CPU: Ryzen 7 6800HS
> >  GPU: RX 6700S
> >  Kernel: 6.13.0-rc3-g8faabc041a00
> >  
> >  Problem:
> >  Launching games or gpu bench-marking tools in qemu windows 11 vm will cause
> >  screen artifacts, ultimately qemu will pause with unrecoverable error.

Is there more information on what setup can reproduce it?

For example, does it only happen with Windows guests?  Does the GPU
vendor/model matter?

> >  
> >  Commit:
> >  f9e54c3a2f5b79ecc57c7bc7d0d3521e461a2101 is the first bad commit
> >  commit f9e54c3a2f5b79ecc57c7bc7d0d3521e461a2101
> >  Author: Alex Williamson <alex.williamson@xxxxxxxxxx>
> >  Date:   Mon Aug 26 16:43:53 2024 -0400
> >  
> >      vfio/pci: implement huge_fault support

Personally I have no clue yet on how this could affect it.  I was initially
worrying on any implicit cache mode changes on the mappings, but I don't
think any of such was involved in this specific change.

This commit majorly does two things: (1) allow 2M/1G mappings for BARs
instead of small 4Ks always, and (2) always lazy faults rather than
"install everything in the 1st fault".  Maybe one of the two could have
some impact in some way.

IIUC basic paths were covered and hopefully should work, so I wonder what's
the specialty. Might be relevant to above questions on the reproduceable
setups.

Thanks,

-- 
Peter Xu