On Mon, Jan 30, 2023 at 4:51 PM Guilherme G. Piccoli <gpiccoli@xxxxxxxxxx> wrote: > > + Luben > > (sorry, missed that in the first submission). > > On 30/01/2023 18:45, Guilherme G. Piccoli wrote: > > Currently amdgpu calls drm_sched_fini() from the fence driver sw fini > > routine - such function is expected to be called only after the > > respective init function - drm_sched_init() - was executed successfully. > > > > Happens that we faced a driver probe failure in the Steam Deck > > recently, and the function drm_sched_fini() was called even without > > its counter-part had been previously called, causing the following oops: > > > > amdgpu: probe of 0000:04:00.0 failed with error -110 > > BUG: kernel NULL pointer dereference, address: 0000000000000090 > > PGD 0 P4D 0 > > Oops: 0002 [#1] PREEMPT SMP NOPTI > > CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338 > > Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022 > > RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] > > [...] > > Call Trace: > > <TASK> > > amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu] > > amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu] > > amdgpu_driver_release_kms+0x16/0x30 [amdgpu] > > devm_drm_dev_init_release+0x49/0x70 > > [...] > > > > To prevent that, check if the drm_sched was properly initialized for a > > given ring before calling its fini counter-part. > > > > Notice ideally we'd use sched.ready for that; such field is set as the latest > > thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such > > field - in the above oops for example, it was a GFX ring causing the crash, and > > the sched.ready field was set to true in the ring init routine, regardless of > > the state of the DRM scheduler. Hence, we ended-up using another sched field. > >> > Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)") > > Cc: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> > > Cc: Guchun Chen <guchun.chen@xxxxxxx> > > Cc: Mario Limonciello <mario.limonciello@xxxxxxx> > > Signed-off-by: Guilherme G. Piccoli <gpiccoli@xxxxxxxxxx> > > --- > > > > > > Hi folks, first of all thanks in advance for reviews / comments! > > Notice that I've used the Fixes tag more in the sense to bring it > > to stable, I didn't find a good patch candidate that added the > > call to drm_sched_fini(), was reaching way too old commits...so > > 067f44c8b459 seems a good candidate - or maybe not? > > > > Now, with regards sched.ready, spent a bit of time to figure what > > was happening...would be feasible maybe to stop using that to > > mark some kind ring status? I think it should be possible to add a > > flag to the ring structure for that, and free sched.ready from > > being manipulate by the amdgpu driver, what's your thoughts on that? It's been a while, but IIRC, we used to have a ring->ready field in the driver which at some point got migrated out of the driver into the GPU scheduler and the driver side code never got cleaned up. I think we should probably just drop the driver messing with that field and leave it up to the drm scheduler. Alex > > > > I could try myself, but first of course I'd like to raise the > > "temperature" on this topic and check if somebody is already working > > on that. > > > > Cheers, > > > > Guilherme > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++- > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > > index 00444203220d..e154eb8241fb 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > > @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev) > > if (!ring || !ring->fence_drv.initialized) > > continue; > > > > - if (!ring->no_scheduler) > > + /* > > + * Notice we check for sched.name since there's some > > + * override on the meaning of sched.ready by amdgpu. > > + * The natural check would be sched.ready, which is > > + * set as drm_sched_init() finishes... > > + */ > > + if (!ring->no_scheduler && ring->sched.name) > > drm_sched_fini(&ring->sched); > > > > for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)