[...] > > The OOPS happens because the rq member of entity is NULL in > > drm_sched_job_arm() after the call to drm_sched_entity_select_rq(). > > > > In drm_sched_entity_select_rq(), the code considers that > > drb_sched_pick_best() might return a NULL value. When NULL, it assigns > > NULL to entity->rq even if it had a non-NULL value before. > > > > drm_sched_job_arm() does not deal with entities having a rq of NULL. > > > > Fix this by leaving the entity on the engine it was instead of > > assigning a NULL to its run queue member. > > Well that is clearly not the correct approach to fixing this. So clearly > a NAK from my side. > > The real question is why is amdgpu_cs_ioctl() called when all of > userspace should be frozen? > > Regards, > Christian. > Could the OOPS happen at resume time? Might it be that the kernel activates user-space before all the components of the GPU finished their wakeup? Maybe drm_sched_pick_best() returns NULL since no scheduler is ready yet? Apart from whether amdgpu_cs_ioctl() should run at this point, I still think the suggested change improves the code. drm_sched_pick_best() can return NULL. drm_sched_entity_select_rq() can handle the NULL (partially). drm_sched_job_arm() crashes on an entity that has rq set to NULL. The handling of NULL values is half-baked. In my opinion, you should define if drm_sched_pick_best() may put a NULL into rq. If your answer is yes, it might put a NULL there; then, there should be a BUG_ON(!entity->rq) after the invocation of drm_sched_entity_select_rq(). If your answer is no, the BUG_ON() should be in drm_sched_pick_best(). That helps guys with zero domain knowledge, like me, to figure out how this is all supposed to work. best regards, Philipp