On Tue, Dec 27, 2022 at 10:40 AM Alex Deucher <alexdeucher@xxxxxxxxx> wrote: > > On Sun, Dec 25, 2022 at 10:31 AM Christian König > <christian.koenig@xxxxxxx> wrote: > > > > Am 24.12.22 um 10:34 schrieb Thomas Zimmermann: > > > Hi > > > > > > Am 22.12.22 um 19:30 schrieb Mario Limonciello: > > >> One of the first thing that KMS drivers do during initialization is > > >> destroy the system firmware framebuffer by means of > > >> `drm_aperture_remove_conflicting_pci_framebuffers` > > >> > > >> This means that if for any reason the GPU failed to probe the user > > >> will be stuck with at best a screen frozen at the last thing that > > >> was shown before the KMS driver continued it's probe. > > >> > > >> The problem is most pronounced when new GPU support is introduced > > >> because users will need to have a recent linux-firmware snapshot > > >> on their system when they boot a kernel with matching support. > > >> > > >> However the problem is further exaggerated in the case of amdgpu because > > >> it has migrated to "IP discovery" where amdgpu will attempt to load > > >> on "ALL" AMD GPUs even if the driver is missing support for IP blocks > > >> contained in that GPU. > > >> > > >> IP discovery requires some probing and isn't run until after the > > >> framebuffer has been destroyed. > > >> > > >> This means a situation can occur where a user purchases a new GPU not > > >> yet supported by a distribution and when booting the installer it will > > >> "freeze" even if the distribution doesn't have the matching kernel > > >> support > > >> for those IP blocks. > > >> > > >> The perfect example of this is Ubuntu 21.10 and the new dGPUs just > > >> launched by AMD. The installation media ships with kernel 5.19 (which > > >> has IP discovery) but the amdgpu support for those IP blocks landed in > > >> kernel 6.0. The matching linux-firmware was released after 21.10's > > >> launch. > > >> The screen will freeze without nomodeset. Even if a user manages to > > >> install > > >> and then upgrades to kernel 6.0 after install they'll still have the > > >> problem of missing firmware, and the same experience. > > >> > > >> This is quite jarring for users, particularly if they don't know > > >> that they have to use "nomodeset" to install. > > >> > > >> To help the situation, allow drivers to re-run the init process for the > > >> firmware framebuffer during a failed probe. As this problem is most > > >> pronounced with amdgpu, this is the only driver changed. > > >> > > >> But if this makes sense more generally for other KMS drivers, the call > > >> can be added to the cleanup routine for those too. > > > > > > Just a quick drive-by comment: as Javier noted, at some point while > > > probing, your driver has changed the device' state and the system FB > > > will be gone. you cannot reestablish the sysfb after that. > > > > I was about to note exactly that as well. This effort here is > > unfortunately pretty pointless. > > > > > > > > You are, however free to read device state at any time, as long as it > > > has no side effects. > > > > > > So why not just move the call to > > > drm_aperture_remove_conflicting_pci_framebuffers() to a later point > > > when you know that your driver supports the hardware? That's the > > > solution we always proposed to this kind of problem. It's safe and > > > won't require any changes to the aperture helpers. > > > > if I'm not completely mistaken that's a little bit tricky. Currently > > it's not possible to read the discovery table before disabling the VGA > > and/or current framebuffer. > > > > We might be able to do this, but it's probably not easy. > > > It should be possible. It's populated by the PSP/VBIOS at power up, > so all you need to do is read the right offset in vram. For > firmwares, we currently read them from the filesystem from the > relevant IP code, but we could also just read it in amdgpu_discovery.c > when we walk the IP discovery table. I think something like this would do the trick: diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 2017b3466612..45aee27ab6b1 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -2141,6 +2141,11 @@ static int amdgpu_device_ip_early_init(struct amdgpu_device *adev) break; } + /* Get rid of things like offb */ + r = drm_aperture_remove_conflicting_pci_framebuffers(pdev, &amdgpu_kms_driver); + if (r) + return r; + if (amdgpu_has_atpx() && (amdgpu_is_atpx_hybrid() || amdgpu_has_atpx_dgpu_power_cntl()) && diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index b8cfa48fb296..4e74d7abc3c2 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -2123,11 +2123,6 @@ static int amdgpu_pci_probe(struct pci_dev *pdev, } #endif - /* Get rid of things like offb */ - ret = drm_aperture_remove_conflicting_pci_framebuffers(pdev, &amdgpu_kms_driver); - if (ret) - return ret; - adev = devm_drm_dev_alloc(&pdev->dev, &amdgpu_kms_driver, typeof(*adev), ddev); if (IS_ERR(adev)) return PTR_ERR(adev); > > Alex > > > > > > Regards, > > Christian. > > > > > > > > > > Best regards > > > Thomas > > > > > >> > > >> Here is a sample of what happens with missing GPU firmware and this > > >> series: > > >> > > >> [ 5.950056] amdgpu 0000:63:00.0: vgaarb: deactivate vga console > > >> [ 5.950114] amdgpu 0000:63:00.0: enabling device (0006 -> 0007) > > >> [ 5.950883] [drm] initializing kernel modesetting (YELLOW_CARP > > >> 0x1002:0x1681 0x17AA:0x22F1 0xD2). > > >> [ 5.952954] [drm] register mmio base: 0xB0A00000 > > >> [ 5.952958] [drm] register mmio size: 524288 > > >> [ 5.954633] [drm] add ip block number 0 <nv_common> > > >> [ 5.954636] [drm] add ip block number 1 <gmc_v10_0> > > >> [ 5.954637] [drm] add ip block number 2 <navi10_ih> > > >> [ 5.954638] [drm] add ip block number 3 <psp> > > >> [ 5.954639] [drm] add ip block number 4 <smu> > > >> [ 5.954641] [drm] add ip block number 5 <dm> > > >> [ 5.954642] [drm] add ip block number 6 <gfx_v10_0> > > >> [ 5.954643] [drm] add ip block number 7 <sdma_v5_2> > > >> [ 5.954644] [drm] add ip block number 8 <vcn_v3_0> > > >> [ 5.954645] [drm] add ip block number 9 <jpeg_v3_0> > > >> [ 5.954663] amdgpu 0000:63:00.0: amdgpu: Fetched VBIOS from VFCT > > >> [ 5.954666] amdgpu: ATOM BIOS: 113-REMBRANDT-X37 > > >> [ 5.954677] [drm] VCN(0) decode is enabled in VM mode > > >> [ 5.954678] [drm] VCN(0) encode is enabled in VM mode > > >> [ 5.954680] [drm] JPEG decode is enabled in VM mode > > >> [ 5.954681] amdgpu 0000:63:00.0: amdgpu: Trusted Memory Zone (TMZ) > > >> feature disabled as experimental (default) > > >> [ 5.954683] amdgpu 0000:63:00.0: amdgpu: PCIE atomic ops is not > > >> supported > > >> [ 5.954724] [drm] vm size is 262144 GB, 4 levels, block size is > > >> 9-bit, fragment size is 9-bit > > >> [ 5.954732] amdgpu 0000:63:00.0: amdgpu: VRAM: 512M > > >> 0x000000F400000000 - 0x000000F41FFFFFFF (512M used) > > >> [ 5.954735] amdgpu 0000:63:00.0: amdgpu: GART: 1024M > > >> 0x0000000000000000 - 0x000000003FFFFFFF > > >> [ 5.954738] amdgpu 0000:63:00.0: amdgpu: AGP: 267419648M > > >> 0x000000F800000000 - 0x0000FFFFFFFFFFFF > > >> [ 5.954747] [drm] Detected VRAM RAM=512M, BAR=512M > > >> [ 5.954750] [drm] RAM width 256bits LPDDR5 > > >> [ 5.954834] [drm] amdgpu: 512M of VRAM memory ready > > >> [ 5.954838] [drm] amdgpu: 15680M of GTT memory ready. > > >> [ 5.954873] [drm] GART: num cpu pages 262144, num gpu pages 262144 > > >> [ 5.955333] [drm] PCIE GART of 1024M enabled (table at > > >> 0x000000F41FC00000). > > >> [ 5.955502] amdgpu 0000:63:00.0: Direct firmware load for > > >> amdgpu/yellow_carp_toc.bin failed with error -2 > > >> [ 5.955505] amdgpu 0000:63:00.0: amdgpu: fail to request/validate > > >> toc microcode > > >> [ 5.955510] [drm:psp_sw_init [amdgpu]] *ERROR* Failed to load psp > > >> firmware! > > >> [ 5.955725] [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* sw_init > > >> of IP block <psp> failed -2 > > >> [ 5.955952] amdgpu 0000:63:00.0: amdgpu: amdgpu_device_ip_init failed > > >> [ 5.955954] amdgpu 0000:63:00.0: amdgpu: Fatal error during GPU init > > >> [ 5.955957] amdgpu 0000:63:00.0: amdgpu: amdgpu: finishing device. > > >> [ 5.971162] efifb: probing for efifb > > >> [ 5.971281] efifb: showing boot graphics > > >> [ 5.974803] efifb: framebuffer at 0x910000000, using 20252k, total > > >> 20250k > > >> [ 5.974805] efifb: mode is 2880x1800x32, linelength=11520, pages=1 > > >> [ 5.974807] efifb: scrolling: redraw > > >> [ 5.974807] efifb: Truecolor: size=8:8:8:8, shift=24:16:8:0 > > >> [ 5.974974] Console: switching to colour frame buffer device 180x56 > > >> [ 5.978181] fb0: EFI VGA frame buffer device > > >> [ 5.978199] amdgpu: probe of 0000:63:00.0 failed with error -2 > > >> [ 5.978285] [drm] amdgpu: ttm finalized > > >> > > >> Now if the user loads the firmware into the system they can re-load the > > >> driver or re-attach using sysfs and it gracefully recovers. > > >> > > >> [ 665.080480] [drm] Initialized amdgpu 3.49.0 20150101 for > > >> 0000:63:00.0 on minor 0 > > >> [ 665.090075] fbcon: amdgpudrmfb (fb0) is primary device > > >> [ 665.090248] [drm] DSC precompute is not needed. > > >> > > >> Mario Limonciello (2): > > >> firmware: sysfb: Allow re-creating system framebuffer after init > > >> drm/amd: Re-create firmware framebuffer on failure to probe > > >> > > >> drivers/firmware/efi/sysfb_efi.c | 6 +++--- > > >> drivers/firmware/sysfb.c | 15 ++++++++++++++- > > >> drivers/firmware/sysfb_simplefb.c | 4 ++-- > > >> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 ++ > > >> include/linux/sysfb.h | 5 +++++ > > >> 5 files changed, 26 insertions(+), 6 deletions(-) > > >> > > >> > > >> base-commit: 830b3c68c1fb1e9176028d02ef86f3cf76aa2476 > > > > >