Re: [bugzilla-daemon@xxxxxxxxxx: [Bug 219619] New: vfio-pci: screen graphics artifacts after 6.12 kernel upgrade]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 23.12.24 17:59, Peter Xu wrote:
On Mon, Dec 23, 2024 at 07:37:46AM +0000, Athul Krishna wrote:
Can confirm. Reverting f9e54c3a2f5b from v6.13-rc1 fixed the problem.
Device: Asus Zephyrus GA402RJ
  CPU: Ryzen 7 6800HS
  GPU: RX 6700S
  Kernel: 6.13.0-rc3-g8faabc041a00
Problem:
  Launching games or gpu bench-marking tools in qemu windows 11 vm will cause
  screen artifacts, ultimately qemu will pause with unrecoverable error.

Is there more information on what setup can reproduce it?

For example, does it only happen with Windows guests?  Does the GPU
vendor/model matter?

In my case, both Windows and Linux guests fail to initialize the GPU in the first place since 6.12; QEMU does not crash. I also found commit f9e54c3a2f5b79ecc57c7bc7d0d3521e461a2101 by bisection.

CPU: AMD 7950X3D
GPU (guest): AMD RX 6700XT (12GB)
Mainboard: ASRock X670E Steel Legend
Kernel: 6.12.0-rc0 .. 6.13.0-rc2

Based on a handful of reports on the Arch forum and on r/vfio, my guess is that affected users have Resizable BAR or similar settings enabled in the firmware, which usually applies the maximum possible BAR size advertised by the GPU on startup. Non-2^n-sized VRAM buffers may be another commonality.

Some other reports I found that could fit to this regression:
[1] https://bbs.archlinux.org/viewtopic.php?id=301352
- Several reports (besides mine), not clear which of those cases are triggered by the vfio-pci commit. One case is clearly caused by a different commit in KVM. Potential candidates for the vfio-pci commit (speculation): (a) 6700XT GPU; (b) 5950X CPU, RTX 3090 GPU
[2] https://old.reddit.com/r/VFIO/comments/1hkvedq/
- Two users, 7900XT and 7900XTX, reported that reverting 6.12 or disabling ReBAR resolves Windows guest GPU initialization.

On my 6700XT (PCI address 03:00.0, 12GB of VRAM), I get the following Resizable BAR default configuration with the host firmware/UEFI setting enabled:

[root]# lspci -s 03:00.0 -vv
...
Capabilities: [200 v1] Physical Resizable BAR
	BAR 0: current size: 16GB, supported: 256MB 512MB 1GB 2GB 4GB 8GB 16GB
BAR 2: current size: 256MB, supported: 2MB 4MB 8MB 16MB 32MB 64MB 128MB 256MB
...

The 16GB configuration above fails with 6.12 (unless I revert commit f9e54c3a2f5b79ecc57c7bc7d0d3521e461a2101). Now, if I change BAR 0 to 8GB (as below), which is below the GPU's VRAM size of 12GB, the Linux guest manages to initialize the GPU.

[root]# echo "0000:03:00.0" > /sys/bus/pci/drivers/vfio-pci/unbind
[root]# #13: 8GB, 14: 16GB
[root]# echo 13 > /sys/bus/pci/devices/0000:03:00.0/resource0_resize
[root]# echo "0000:03:00.0" > /sys/bus/pci/drivers/vfio-pci/bind

On my mainboard, 'Resizable BAR off' sets BAR 0 to 256MB, which also works with 6.12.

Only the size of BAR 0 (VRAM) appears to be relevant here. BAR 2 sizes of 2MB vs. 256MB have no effect on the outcome.


Commit:
  f9e54c3a2f5b79ecc57c7bc7d0d3521e461a2101 is the first bad commit
  commit f9e54c3a2f5b79ecc57c7bc7d0d3521e461a2101
  Author: Alex Williamson <alex.williamson@xxxxxxxxxx>
  Date:   Mon Aug 26 16:43:53 2024 -0400
vfio/pci: implement huge_fault support

Personally I have no clue yet on how this could affect it.  I was initially
worrying on any implicit cache mode changes on the mappings, but I don't
think any of such was involved in this specific change.

This commit majorly does two things: (1) allow 2M/1G mappings for BARs
instead of small 4Ks always, and (2) always lazy faults rather than
"install everything in the 1st fault".  Maybe one of the two could have
some impact in some way.

In my case, commenting out (1) the huge_fault callback assignment from f9e54c3a2f5b suffices for GPU initialization in the guest, even if (2) the 'install everything' loop is still removed.

I have uploaded host kernel logs with vfio-pci-core debugging enabled (one log with stock sources, one large log with vfio-pci-core's huge_fault handler patched out):
https://bugzilla.kernel.org/show_bug.cgi?id=219619#c1
I'm not sure if the logs of handled faults say much about what specifically goes wrong here, though.

The dmesg portion attached to my mail is of a Linux guest failing to initialize the GPU (BAR 0 size 16GB with 12GB of VRAM).

Thanks,
Precific
- Dmesg of a linux guest failing amdgpu initialization. Host running kernel 6.12/6.13, with ReBAR enabled (16GB BAR 0)
[[note: some variations can occur, e.g., the error sometimes occurs at a later stage of initialization]]

[   10.245100] [drm] amdgpu kernel modesetting enabled.
[   10.245173] amdgpu: Virtual CRAT table created for CPU
[   10.245182] amdgpu: Topology: Add CPU node
[   10.245480] [drm] initializing kernel modesetting (NAVY_FLOUNDER 0x1002:0x73DF 0x1002:0x0E36 0xC1).
[   10.245492] [drm] register mmio base: 0x81A00000
[   10.245493] [drm] register mmio size: 1048576
[   10.248861] [drm] add ip block number 0 <nv_common>
[   10.248862] [drm] add ip block number 1 <gmc_v10_0>
[   10.248863] [drm] add ip block number 2 <navi10_ih>
[   10.248864] [drm] add ip block number 3 <psp>
[   10.248864] [drm] add ip block number 4 <smu>
[   10.248865] [drm] add ip block number 5 <dm>
[   10.248866] [drm] add ip block number 6 <gfx_v10_0>
[   10.248867] [drm] add ip block number 7 <sdma_v5_2>
[   10.248867] [drm] add ip block number 8 <vcn_v3_0>
[   10.248868] [drm] add ip block number 9 <jpeg_v3_0>
[   10.248877] amdgpu 0000:05:00.0: amdgpu: Fetched VBIOS from VFCT
[   10.248878] amdgpu: ATOM BIOS: 113-D5121100-101
[   10.270097] [drm] VCN(0) decode is enabled in VM mode
[   10.270099] [drm] VCN(0) encode is enabled in VM mode
[   10.284318] [drm] JPEG decode is enabled in VM mode
[   10.284320] amdgpu 0000:05:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[   10.284359] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[   10.284365] amdgpu 0000:05:00.0: amdgpu: VRAM: 12272M 0x0000008000000000 - 0x00000082FEFFFFFF (12272M used)
[   10.284367] amdgpu 0000:05:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[   10.284375] [drm] Detected VRAM RAM=12272M, BAR=16384M
[   10.284376] [drm] RAM width 192bits GDDR6
[   10.284495] [drm] amdgpu: 12272M of VRAM memory ready
[   10.284496] [drm] amdgpu: 16042M of GTT memory ready.
[   10.284505] [drm] GART: num cpu pages 131072, num gpu pages 131072
[   10.284626] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[   12.218276] amdgpu 0000:05:00.0: amdgpu: STB initialized to 2048 entries
[   12.218333] [drm] Loading DMUB firmware via PSP: version=0x02020020
[   12.218647] [drm] use_doorbell being set to: [true]
[   12.218658] [drm] use_doorbell being set to: [true]
[   12.218667] [drm] Found VCN firmware Version ENC: 1.30 DEC: 3 VEP: 0 Revision: 4
[   12.218672] amdgpu 0000:05:00.0: amdgpu: Will use PSP to load VCN firmware
[   14.390991] [drm] psp gfx command ID_LOAD_TOC(0x20) failed and response status is (0x0)
[   14.390994] [drm:psp_hw_start [amdgpu]] *ERROR* Failed to load toc
[   14.391223] [drm:psp_hw_start [amdgpu]] *ERROR* PSP tmr init failed!
[   14.411423] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading failed
[   14.411604] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22
[   14.411784] amdgpu 0000:05:00.0: amdgpu: amdgpu_device_ip_init failed
[   14.411785] amdgpu 0000:05:00.0: amdgpu: Fatal error during GPU init
[   14.411786] amdgpu 0000:05:00.0: amdgpu: amdgpu: finishing device.
[   14.411928] ------------[ cut here ]------------
[   14.411929] WARNING: CPU: 6 PID: 507 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:622 amdgpu_irq_put+0x46/0x70 [amdgpu]
[   14.412114] Modules linked in: amdgpu(+) video wmi amdxcp i2c_algo_bit drm_ttm_helper crct10dif_pclmul ttm crc32_pclmul crc32c_intel polyval_clmulni drm_exec polyval_generic ghash_clmulni_intel gpu_sched nvme sha512_ssse3 drm_suballoc_helper drm_buddy sha256_ssse3 drm_display_helper nvme_core sha1_ssse3 virtio_net cec nvme_auth virtio_console net_failover virtio_blk failover qemu_fw_cfg serio_raw ip6_tables ip_tables fuse
[   14.412133] CPU: 6 PID: 507 Comm: (udev-worker) Not tainted 6.8.5-201.fc39.x86_64 #1
[   14.412134] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
[   14.412135] RIP: 0010:amdgpu_irq_put+0x46/0x70 [amdgpu]
[   14.412305] Code: c0 74 33 48 8b 4e 10 48 83 39 00 74 29 89 d1 48 8d 04 88 8b 08 85 c9 74 11 f0 ff 08 74 07 31 c0 e9 6a 30 bc e3 e9 5a fd ff ff <0f> 0b b8 ea ff ff ff e9 59 30 bc e3 b8 ea ff ff ff e9 4f 30 bc e3
[   14.412306] RSP: 0018:ffffaae50112ba60 EFLAGS: 00010246
[   14.412308] RAX: ffff8bbcca3ed100 RBX: ffff8bbcd19987a8 RCX: 0000000000000000
[   14.412309] RDX: 0000000000000000 RSI: ffff8bbcd19a4db8 RDI: ffff8bbcd1980000
[   14.412310] RBP: ffff8bbcd19901e8 R08: 0000000000000000 R09: ffffaae50112b878
[   14.412311] R10: ffffaae50112b870 R11: 0000000000000003 R12: ffff8bbcd19905c8
[   14.412311] R13: ffff8bbcd1980010 R14: ffff8bbcd1980000 R15: ffff8bbcd19a4db8
[   14.412313] FS:  00007f5fde03e980(0000) GS:ffff8bc41fb80000(0000) knlGS:0000000000000000
[   14.412315] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   14.412316] CR2: 00005623742f1000 CR3: 000000010c1fa000 CR4: 0000000000750ef0
[   14.412318] PKRU: 55555554
[   14.412319] Call Trace:
[   14.412320]  <TASK>
[   14.412321]  ? amdgpu_irq_put+0x46/0x70 [amdgpu]
[   14.412493]  ? __warn+0x81/0x130
[   14.412497]  ? amdgpu_irq_put+0x46/0x70 [amdgpu]
[   14.412677]  ? report_bug+0x171/0x1a0
[   14.412681]  ? handle_bug+0x3c/0x80
[   14.412683]  ? exc_invalid_op+0x17/0x70
[   14.412685]  ? asm_exc_invalid_op+0x1a/0x20
[   14.412688]  ? amdgpu_irq_put+0x46/0x70 [amdgpu]
[   14.412857]  amdgpu_fence_driver_hw_fini+0xfe/0x130 [amdgpu]
[   14.413049]  amdgpu_device_fini_hw+0xa6/0x400 [amdgpu]
[   14.413233]  ? blocking_notifier_chain_unregister+0x36/0x50
[   14.413236]  amdgpu_driver_load_kms+0xec/0x190 [amdgpu]
[   14.413411]  amdgpu_pci_probe+0x18b/0x510 [amdgpu]
[   14.413586]  local_pci_probe+0x42/0xa0
[   14.413589]  pci_device_probe+0xc7/0x240
[   14.413592]  really_probe+0x19b/0x3e0
[   14.413595]  ? __pfx___driver_attach+0x10/0x10
[   14.413597]  __driver_probe_device+0x78/0x160
[   14.413599]  driver_probe_device+0x1f/0x90
[   14.413601]  __driver_attach+0xd2/0x1c0
[   14.413603]  bus_for_each_dev+0x85/0xd0
[   14.413605]  bus_add_driver+0x116/0x220
[   14.413607]  driver_register+0x59/0x100
[   14.413609]  ? __pfx_amdgpu_init+0x10/0x10 [amdgpu]
[   14.413768]  do_one_initcall+0x58/0x320
[   14.413772]  do_init_module+0x60/0x240
[   14.413775]  __do_sys_init_module+0x17f/0x1b0
[   14.413776]  ? srso_alias_return_thunk+0x5/0xfbef5
[   14.413782]  do_syscall_64+0x83/0x170
[   14.413784]  ? srso_alias_return_thunk+0x5/0xfbef5
[   14.413786]  ? __count_memcg_events+0x4d/0xc0
[   14.413788]  ? srso_alias_return_thunk+0x5/0xfbef5
[   14.413790]  ? count_memcg_events.constprop.0+0x1a/0x30
[   14.413792]  ? srso_alias_return_thunk+0x5/0xfbef5
[   14.413793]  ? handle_mm_fault+0xa2/0x360
[   14.413795]  ? srso_alias_return_thunk+0x5/0xfbef5
[   14.413797]  ? do_user_addr_fault+0x304/0x670
[   14.413800]  ? srso_alias_return_thunk+0x5/0xfbef5
[   14.413801]  ? srso_alias_return_thunk+0x5/0xfbef5
[   14.413803]  entry_SYSCALL_64_after_hwframe+0x78/0x80
[   14.413805] RIP: 0033:0x7f5fdea2cb9e
[   14.413808] Code: 48 8b 0d 95 12 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 62 12 0c 00 f7 d8 64 89 01 48
[   14.413809] RSP: 002b:00007ffc13be8998 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[   14.413811] RAX: ffffffffffffffda RBX: 00005623741c55a0 RCX: 00007f5fdea2cb9e
[   14.413812] RDX: 00005623741be530 RSI: 00000000019d58ce RDI: 00007f5fdb000010
[   14.413813] RBP: 00007ffc13be8a50 R08: 0000562374199010 R09: 0000000000000007
[   14.413814] R10: 0000000000000001 R11: 0000000000000246 R12: 00005623741be530
[   14.413814] R13: 0000000000020000 R14: 00005623741c0030 R15: 00005623741c9120
[   14.413817]  </TASK>
[   14.413818] ---[ end trace 0000000000000000 ]---


[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux