Re: [bugzilla-daemon@xxxxxxxxxx: [Bug 219619] New: vfio-pci: screen graphics artifacts after 6.12 kernel upgrade]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 23 Dec 2024 11:59:06 -0500
Peter Xu <peterx@xxxxxxxxxx> wrote:

> On Mon, Dec 23, 2024 at 07:37:46AM +0000, Athul Krishna wrote:
> > Can confirm. Reverting f9e54c3a2f5b from v6.13-rc1 fixed the problem.  
> 
> I suppose Alex should have some more thoughts, probably after the holidays.
> Before that, one quick question to ask..

Yeah, apologies in advance for latency over the next couple weeks.

> > -------- Original Message --------
> > On 23/12/24 04:06, Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> >   
> > >  Forwarding since not everybody follows bugzilla.  Apparently bisected
> > >  to f9e54c3a2f5b ("vfio/pci: implement huge_fault support").
> > >  
> > >  Athul, f9e54c3a2f5b appears to revert cleanly from v6.13-rc1.  Can you
> > >  verify that reverting it is enough to avoid these artifacts?
> > >  
> > >  #regzbot introduced: f9e54c3a2f5b ("vfio/pci: implement huge_fault support")
> > >  
> > >  ----- Forwarded message from bugzilla-daemon@xxxxxxxxxx -----
> > >  
> > >  Date: Sat, 21 Dec 2024 10:10:02 +0000
> > >  From: bugzilla-daemon@xxxxxxxxxx
> > >  To: bjorn@xxxxxxxxxxxxxxxxxxxxxxx
> > >  Subject: [Bug 219619] New: vfio-pci: screen graphics artifacts after 6.12 kernel upgrade
> > >  Message-ID: <bug-219619-41252@xxxxxxxxxxxxxxxxxxxxxxxxx/>
> > >  
> > >  https://bugzilla.kernel.org/show_bug.cgi?id=219619
> > >  
> > >              Bug ID: 219619
> > >             Summary: vfio-pci: screen graphics artifacts after 6.12 kernel
> > >                      upgrade
> > >             Product: Drivers
> > >             Version: 2.5
> > >            Hardware: AMD
> > >                  OS: Linux
> > >              Status: NEW
> > >            Severity: normal
> > >            Priority: P3
> > >           Component: PCI
> > >            Assignee: drivers_pci@xxxxxxxxxxxxxxxxxxxx
> > >            Reporter: athul.krishna.kr@xxxxxxxxxxxxxx
> > >          Regression: No
> > >  
> > >  Created attachment 307382  
> > >    --> https://bugzilla.kernel.org/attachment.cgi?id=307382&action=edit  
> > >  dmesg  
> 
> vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs

Is the reset recovery message seen even with the suspect commit
reverted?  Timestamps here would be useful for correlation.

> pcieport 0000:00:01.1: AER: Multiple Uncorrectable (Non-Fatal) error message received from 0000:03:00.1
> vfio-pci 0000:03:00.0: PCIe Bus Error: severity=Uncorrectable (Non-Fatal), type=Transaction Layer, (Requester ID)
> vfio-pci 0000:03:00.0:   device [1002:73ef] error status/mask=00100000/00000000
> vfio-pci 0000:03:00.0:    [20] UnsupReq               (First)
> vfio-pci 0000:03:00.0: AER:   TLP Header: 60001004 000000ff 0000007d fe7eb000
> vfio-pci 0000:03:00.1: PCIe Bus Error: severity=Uncorrectable (Non-Fatal), type=Transaction Layer, (Requester ID)
> vfio-pci 0000:03:00.1:   device [1002:ab28] error status/mask=00100000/00000000
> vfio-pci 0000:03:00.1:    [20] UnsupReq               (First)
> vfio-pci 0000:03:00.1: AER:   TLP Header: 60001004 000000ff 0000007d fe7eb000
> vfio-pci 0000:03:00.1: AER:   Error of this Agent is reported first
> pcieport 0000:02:00.0: AER: broadcast error_detected message
> pcieport 0000:02:00.0: AER: broadcast mmio_enabled message
> pcieport 0000:02:00.0: AER: broadcast resume message
> pcieport 0000:02:00.0: AER: device recovery successful
> pcieport 0000:02:00.0: AER: broadcast error_detected message
> pcieport 0000:02:00.0: AER: broadcast mmio_enabled message
> pcieport 0000:02:00.0: AER: broadcast resume message
> pcieport 0000:02:00.0: AER: device recovery successful
> 
> > >  
> > >  Device: Asus Zephyrus GA402RJ
> > >  CPU: Ryzen 7 6800HS
> > >  GPU: RX 6700S
> > >  Kernel: 6.13.0-rc3-g8faabc041a00
> > >  
> > >  Problem:
> > >  Launching games or gpu bench-marking tools in qemu windows 11 vm will cause
> > >  screen artifacts, ultimately qemu will pause with unrecoverable error.  
> 
> Is there more information on what setup can reproduce it?
> 
> For example, does it only happen with Windows guests?  Does the GPU
> vendor/model matter?

And the CPU vendor, this was predominately tested by me on Intel +
NVIDIA.  I'm also not seeing any similar reports on r/VFIO, which is a
bit strange as there are a lot of bleeding edge users there.  The bz is
reported against 6.13.0-rc3-g8faabc041a00 and a revert against
v6.13-rc1 was reported as stable.  Has this actually been confirmed on
v6.12, or might something in v6.13-rc have introduced a new issue?

> > >  Commit:
> > >  f9e54c3a2f5b79ecc57c7bc7d0d3521e461a2101 is the first bad commit
> > >  commit f9e54c3a2f5b79ecc57c7bc7d0d3521e461a2101
> > >  Author: Alex Williamson <alex.williamson@xxxxxxxxxx>
> > >  Date:   Mon Aug 26 16:43:53 2024 -0400
> > >  
> > >      vfio/pci: implement huge_fault support  
> 
> Personally I have no clue yet on how this could affect it.  I was initially
> worrying on any implicit cache mode changes on the mappings, but I don't
> think any of such was involved in this specific change.
> 
> This commit majorly does two things: (1) allow 2M/1G mappings for BARs
> instead of small 4Ks always, and (2) always lazy faults rather than
> "install everything in the 1st fault".  Maybe one of the two could have
> some impact in some way.

Athul, can you test reverting both f9e54c3a2f5b and d71a989cf5d9?  That
would provide the faulting behavior without yet making use of huge
pfnmaps.  Thanks,

Alex





[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux