[Public] > -----Original Message----- > From: Juergen Gross <jgross@xxxxxxxx> > Sent: Friday, December 15, 2023 6:57 AM > To: lkml <linux-kernel@xxxxxxxxxxxxxxx>; xen-devel@xxxxxxxxxxxxxxxxxxxx; amd- > gfx@xxxxxxxxxxxxxxxxxxxxx > Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian > <Christian.Koenig@xxxxxxx>; Pan, Xinhui <Xinhui.Pan@xxxxxxx> > Subject: Crashes under Xen with Radeon graphics card > > Hi, > > I recently stumbled over a test system which showed crashes probably > resulting from memory being overwritten randomly. > > The problem is occurring only in Dom0 when running under Xen. It seems to > be present since at least kernel 6.3 (I didn't go back further yet), and it seems > NOT to be present in kernel 5.14. > > I tracked the problem down to the initialization of the graphics card (the > problem might surface only later, but at least an early initialization failure made > the problem go away). > > # lspci > 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] > Caicos XTX [Radeon HD 8490 / R5 235X OEM] > 01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Caicos HDMI > Audio [Radeon HD 6450 / 7450/8450/8490 OEM / R5 230/235/235X OEM] > > I had a working .config and one which did produce the crashes, so I narrowed > the problem down to detect that the important difference was in the area of > firmware loading (the working .config didn't have > CONFIG_FW_LOADER_COMPRESS_XZ set, causing firmware loading for the > card to fail). This was of course not the real problem, but it caused the card > initialization to fail. > > I manually decompressed the firmware files one by one to see whether the > problem would be in the decompressor or probably in the driver of the card. > > The last step without crash was: > > # dmesg | grep radeon > [ 10.106405] [drm] radeon kernel modesetting enabled. > [ 10.106455] radeon 0000:01:00.0: vgaarb: deactivate vga console > [ 10.222944] radeon 0000:01:00.0: VRAM: 1024M 0x0000000000000000 > - > 0x000000003FFFFFFF (1024M used) > [ 10.252921] radeon 0000:01:00.0: GTT: 1024M 0x0000000040000000 - > 0x000000007FFFFFFF > [ 10.278255] [drm] radeon: 1024M of VRAM memory ready > [ 10.295828] [drm] radeon: 1024M of GTT memory ready. > [ 10.295867] radeon 0000:01:00.0: Direct firmware load for > radeon/CAICOS_pfp.bin succeeded > [ 10.330846] radeon 0000:01:00.0: Direct firmware load for > radeon/CAICOS_me.bin succeeded > [ 10.330858] radeon 0000:01:00.0: Direct firmware load for > radeon/BTC_rlc.bin > succeeded > [ 10.330870] radeon 0000:01:00.0: Direct firmware load for > radeon/CAICOS_mc.bin failed with error -2 > [ 10.380979] ni_cp: Failed to load firmware "radeon/CAICOS_mc.bin" > [ 10.381006] [drm:evergreen_init [radeon]] *ERROR* Failed to load > firmware! > [ 10.405765] radeon 0000:01:00.0: Fatal error during GPU init > [ 10.432107] [drm] radeon: finishing device. > [ 10.439179] [drm] radeon: ttm finalized > [ 10.463203] radeon: probe of 0000:01:00.0 failed with error -2 > > And with decompressing radeon/CAICOS_mc.bin I got: > > # dmesg | grep radeon > [ 10.266491] [drm] radeon kernel modesetting enabled. > [ 10.266552] radeon 0000:01:00.0: vgaarb: deactivate vga console > [ 10.456047] radeon 0000:01:00.0: VRAM: 1024M 0x0000000000000000 > - > 0x000000003FFFFFFF (1024M used) > [ 10.470270] radeon 0000:01:00.0: GTT: 1024M 0x0000000040000000 - > 0x000000007FFFFFFF > [ 10.566946] [drm] radeon: 1024M of VRAM memory ready > [ 10.576891] [drm] radeon: 1024M of GTT memory ready. > [ 10.586971] radeon 0000:01:00.0: Direct firmware load for > radeon/CAICOS_pfp.bin succeeded > [ 10.611886] radeon 0000:01:00.0: Direct firmware load for > radeon/CAICOS_me.bin succeeded > [ 10.611909] radeon 0000:01:00.0: Direct firmware load for > radeon/BTC_rlc.bin > succeeded > [ 10.611938] radeon 0000:01:00.0: Direct firmware load for > radeon/CAICOS_mc.bin succeeded > [ 10.660599] radeon 0000:01:00.0: Direct firmware load for > radeon/CAICOS_smc.bin failed with error -2 > [ 10.660601] smc: error loading firmware "radeon/CAICOS_smc.bin" You also need to make sure CAICOS_smc.bin is available. > [ 10.661676] [drm] radeon: power management initialized > [ 10.713666] radeon 0000:01:00.0: Direct firmware load for > radeon/SUMO_uvd.bin > failed with error -2 > [ 10.713668] radeon 0000:01:00.0: radeon_uvd: Can't load firmware > "radeon/SUMO_uvd.bin" > [ 10.713669] radeon 0000:01:00.0: failed UVD (-2) init. And SUMO_uvd.bin. > [ 10.714787] [drm] enabling PCIE gen 2 link speeds, disable with > radeon.pcie_gen2=0 > [ 10.809213] radeon 0000:01:00.0: WB enabled > [ 10.817528] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr > 0x0000000040000c00 > [ 10.833755] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr > 0x0000000040000c0c > [ 10.850330] radeon 0000:01:00.0: radeon: MSI limited to 32-bit > [ 10.862154] radeon 0000:01:00.0: radeon: using MSI. > [ 10.871930] [drm] radeon: irq initialized. > [ 11.062028] [drm] Initialized radeon 2.50.0 20080528 for 0000:01:00.0 on > minor 0 > [ 11.119723] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a > monitor but no|invalid EDID > [ 11.411370] fbcon: radeondrmfb (fb0) is primary device > [ 11.507252] radeon 0000:01:00.0: [drm] fb0: radeondrmfb frame buffer > device > [ 11.674028] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a > monitor but no|invalid EDID > [ 11.834317] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a > monitor but no|invalid EDID > [ 28.313041] snd_hda_intel 0000:01:00.1: bound 0000:01:00.0 (ops > radeon_audio_component_bind_ops [radeon]) > [ 44.371991] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a > monitor but no|invalid EDID > [ 44.428068] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a > monitor but no|invalid EDID > > followed by a crash some seconds after the system was up. > > The crashes vary, but often the kernel accesses non-canonical addresses or > tries to map illegal physical addresses. Sometimes the system is just hanging, > either with softlockups or without any further signs of being alive. > > I can easily reproduce the problem, so any debug patches to narrow down the > problem are welcome. There are still missing firmware required for proper operation. Please fix them up. Alex