Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

Philipp Stanner <pstanner@xxxxxxxxxx> · Mon, 13 Jan 2025 09:43:12 +0100

+cc Danilo
+cc myself

On Wed, 2025-01-08 at 09:19 +0100, Christian König wrote:
> Am 07.01.25 um 16:21 schrieb Philipp Reisner:
> > [...]
> > > > The OOPS happens because the rq member of entity is NULL in
> > > > drm_sched_job_arm() after the call to
> > > > drm_sched_entity_select_rq().
> > > > 
> > > > In drm_sched_entity_select_rq(), the code considers that
> > > > drb_sched_pick_best() might return a NULL value. When NULL, it
> > > > assigns
> > > > NULL to entity->rq even if it had a non-NULL value before.
> > > > 
> > > > drm_sched_job_arm() does not deal with entities having a rq of
> > > > NULL.
> > > > 
> > > > Fix this by leaving the entity on the engine it was instead of
> > > > assigning a NULL to its run queue member.
> > > Well that is clearly not the correct approach to fixing this. So
> > > clearly
> > > a NAK from my side.
> > > 
> > > The real question is why is amdgpu_cs_ioctl() called when all of
> > > userspace should be frozen?
> > > 
> > > Regards,
> > > Christian.
> > > 
> > Could the OOPS happen at resume time? Might it be that the kernel
> > activates user-space
> > before all the components of the GPU finished their wakeup?
> > 
> > Maybe drm_sched_pick_best() returns NULL since no scheduler is
> > ready yet?
> 
> Yeah that is exactly what I meant. It looks like either the suspend
> or 
> the resume order is somehow messed up.
> 
> In other words either some application tries to submit GPU work while
> it 
> should already been stopped, or it tries to submit GPU work before it
> is 
> started.
> 
> > Apart from whether amdgpu_cs_ioctl() should run at this point, I
> > still think the
> > suggested change improves the code. drm_sched_pick_best() can
> > return NULL.
> > drm_sched_entity_select_rq() can handle the NULL (partially).
> > 
> > drm_sched_job_arm() crashes on an entity that has rq set to NULL.
> 
> Which is actually not the worst outcome :)
> 
> With your patch applied we don't immediately crash any more in the 
> submission path, but the whole system could then later deadlock
> because 
> the core memory management waits for a GPU submission which never
> returns.
> 
> That is an even worse situation because you then can't pinpoint any
> more 
> where that is coming from.
> 
> > The handling of NULL values is half-baked.
> > 
> > In my opinion, you should define if drm_sched_pick_best() may put a
> > NULL into
> > rq. If your answer is yes, it might put a NULL there; then, there
> > should be a
> > BUG_ON(!entity->rq) after the invocation of
> > drm_sched_entity_select_rq().
> > If your answer is no, the BUG_ON() should be in
> > drm_sched_pick_best().
> 
> Yeah good point.
> 
> We might not want a BUG_ON(), that is only justified when we prevent 
> further damage (e.g. random data corruption or similar).
> 
> I suggest using a WARN(!shed, "Submission without activated
> sheduler!"). 
> This way the system has at least a chance of survival should the 
> scheduler become ready later on.
> 
> On the other hand the BUG_ON() or the NULL pointer deref should only 
> kill the application thread which is submitting something before the 
> driver is resumed. So that might help to pinpoint where the actually 
> issue is.

As I see it the BUG_ON() would just be a more pretty NULL pointer
deref. If we agree that this is effectively a misuse of the scheduler
API we probably want to add it to make it more pretty, though?

@Philipp:
BTW, I only just discovered this thread by coincidence. Please use
get_maintainer. The scheduler currently has 4 maintainers, and none of
them is on CC.

Danke,
P.

> 
> Regards,
> Christian.
> 
> > 
> > That helps guys with zero domain knowledge, like me, to figure out
> > how
> > this is all
> > supposed to work.
> > 
> > best regards,
> >   Philipp
>