Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

Philipp Reisner <philipp.reisner@xxxxxxxxxx> · Tue, 7 Jan 2025 16:21:57 +0100

[...]
> > The OOPS happens because the rq member of entity is NULL in
> > drm_sched_job_arm() after the call to drm_sched_entity_select_rq().
> >
> > In drm_sched_entity_select_rq(), the code considers that
> > drb_sched_pick_best() might return a NULL value. When NULL, it assigns
> > NULL to entity->rq even if it had a non-NULL value before.
> >
> > drm_sched_job_arm() does not deal with entities having a rq of NULL.
> >
> > Fix this by leaving the entity on the engine it was instead of
> > assigning a NULL to its run queue member.
>
> Well that is clearly not the correct approach to fixing this. So clearly
> a NAK from my side.
>
> The real question is why is amdgpu_cs_ioctl() called when all of
> userspace should be frozen?
>
> Regards,
> Christian.
>

Could the OOPS happen at resume time? Might it be that the kernel
activates user-space
before all the components of the GPU finished their wakeup?

Maybe drm_sched_pick_best() returns NULL since no scheduler is ready yet?

Apart from whether amdgpu_cs_ioctl() should run at this point, I still think the
suggested change improves the code. drm_sched_pick_best() can return NULL.
drm_sched_entity_select_rq() can handle the NULL (partially).

drm_sched_job_arm() crashes on an entity that has rq set to NULL.

The handling of NULL values is half-baked.

In my opinion, you should define if drm_sched_pick_best() may put a NULL into
rq. If your answer is yes, it might put a NULL there; then, there should be a
BUG_ON(!entity->rq) after the invocation of drm_sched_entity_select_rq().
If your answer is no, the BUG_ON() should be in drm_sched_pick_best().

That helps guys with zero domain knowledge, like me, to figure out how
this is all
supposed to work.

best regards,
 Philipp