RE: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

"Chen, Guchun" <Guchun.Chen@xxxxxxx> · Tue, 31 Jan 2023 06:36:39 +0000

Hi Piccoli,

I agree with Alex's point, using ring->sched.name for such check is not a good way. BTW, can you please attach a full dmesg long in bad case to help me understand more?

Regards,
Guchun

-----Original Message-----
From: Alex Deucher <alexdeucher@xxxxxxxxx> 
Sent: Tuesday, January 31, 2023 6:30 AM
To: Guilherme G. Piccoli <gpiccoli@xxxxxxxxxx>
Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; kernel@xxxxxxxxxxxx; Chen, Guchun <Guchun.Chen@xxxxxxx>; Pan, Xinhui <Xinhui.Pan@xxxxxxx>; dri-devel@xxxxxxxxxxxxxxxxxxxxx; Tuikov, Luben <Luben.Tuikov@xxxxxxx>; Limonciello, Mario <Mario.Limonciello@xxxxxxx>; kernel-dev@xxxxxxxxxx; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>
Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

On Mon, Jan 30, 2023 at 4:51 PM Guilherme G. Piccoli <gpiccoli@xxxxxxxxxx> wrote:
>
> + Luben
>
> (sorry, missed that in the first submission).
>
> On 30/01/2023 18:45, Guilherme G. Piccoli wrote:
> > Currently amdgpu calls drm_sched_fini() from the fence driver sw 
> > fini routine - such function is expected to be called only after the 
> > respective init function - drm_sched_init() - was executed successfully.
> >
> > Happens that we faced a driver probe failure in the Steam Deck 
> > recently, and the function drm_sched_fini() was called even without 
> > its counter-part had been previously called, causing the following oops:
> >
> > amdgpu: probe of 0000:04:00.0 failed with error -110
> > BUG: kernel NULL pointer dereference, address: 0000000000000090 PGD 
> > 0 P4D 0
> > Oops: 0002 [#1] PREEMPT SMP NOPTI
> > CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli 
> > #338 Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
> > RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] [...] Call Trace:
> >  <TASK>
> >  amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
> >  amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
> >  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
> >  devm_drm_dev_init_release+0x49/0x70
> >  [...]
> >
> > To prevent that, check if the drm_sched was properly initialized for 
> > a given ring before calling its fini counter-part.
> >
> > Notice ideally we'd use sched.ready for that; such field is set as 
> > the latest thing on drm_sched_init(). But amdgpu seems to "override" 
> > the meaning of such field - in the above oops for example, it was a 
> > GFX ring causing the crash, and the sched.ready field was set to 
> > true in the ring init routine, regardless of the state of the DRM scheduler. Hence, we ended-up using another sched field.
> >> > Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence 
> >> > driver fini in s3 test (v2)")
> > Cc: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
> > Cc: Guchun Chen <guchun.chen@xxxxxxx>
> > Cc: Mario Limonciello <mario.limonciello@xxxxxxx>
> > Signed-off-by: Guilherme G. Piccoli <gpiccoli@xxxxxxxxxx>
> > ---
> >
> >
> > Hi folks, first of all thanks in advance for reviews / comments!
> > Notice that I've used the Fixes tag more in the sense to bring it to 
> > stable, I didn't find a good patch candidate that added the call to 
> > drm_sched_fini(), was reaching way too old commits...so
> > 067f44c8b459 seems a good candidate - or maybe not?
> >
> > Now, with regards sched.ready, spent a bit of time to figure what 
> > was happening...would be feasible maybe to stop using that to mark 
> > some kind ring status? I think it should be possible to add a flag 
> > to the ring structure for that, and free sched.ready from being 
> > manipulate by the amdgpu driver, what's your thoughts on that?

It's been a while, but IIRC, we used to have a ring->ready field in the driver which at some point got migrated out of the driver into the GPU scheduler and the driver side code never got cleaned up.  I think we should probably just drop the driver messing with that field and leave it up to the drm scheduler.

Alex

> >
> > I could try myself, but first of course I'd like to raise the 
> > "temperature" on this topic and check if somebody is already working 
> > on that.
> >
> > Cheers,
> >
> > Guilherme
> >
> >
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > index 00444203220d..e154eb8241fb 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
> >               if (!ring || !ring->fence_drv.initialized)
> >                       continue;
> >
> > -             if (!ring->no_scheduler)
> > +             /*
> > +              * Notice we check for sched.name since there's some
> > +              * override on the meaning of sched.ready by amdgpu.
> > +              * The natural check would be sched.ready, which is
> > +              * set as drm_sched_init() finishes...
> > +              */
> > +             if (!ring->no_scheduler && ring->sched.name)
> >                       drm_sched_fini(&ring->sched);
> >
> >               for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)