+cc Danilo +cc myself On Wed, 2025-01-08 at 09:19 +0100, Christian König wrote: > Am 07.01.25 um 16:21 schrieb Philipp Reisner: > > [...] > > > > The OOPS happens because the rq member of entity is NULL in > > > > drm_sched_job_arm() after the call to > > > > drm_sched_entity_select_rq(). > > > > > > > > In drm_sched_entity_select_rq(), the code considers that > > > > drb_sched_pick_best() might return a NULL value. When NULL, it > > > > assigns > > > > NULL to entity->rq even if it had a non-NULL value before. > > > > > > > > drm_sched_job_arm() does not deal with entities having a rq of > > > > NULL. > > > > > > > > Fix this by leaving the entity on the engine it was instead of > > > > assigning a NULL to its run queue member. > > > Well that is clearly not the correct approach to fixing this. So > > > clearly > > > a NAK from my side. > > > > > > The real question is why is amdgpu_cs_ioctl() called when all of > > > userspace should be frozen? > > > > > > Regards, > > > Christian. > > > > > Could the OOPS happen at resume time? Might it be that the kernel > > activates user-space > > before all the components of the GPU finished their wakeup? > > > > Maybe drm_sched_pick_best() returns NULL since no scheduler is > > ready yet? > > Yeah that is exactly what I meant. It looks like either the suspend > or > the resume order is somehow messed up. > > In other words either some application tries to submit GPU work while > it > should already been stopped, or it tries to submit GPU work before it > is > started. > > > Apart from whether amdgpu_cs_ioctl() should run at this point, I > > still think the > > suggested change improves the code. drm_sched_pick_best() can > > return NULL. > > drm_sched_entity_select_rq() can handle the NULL (partially). > > > > drm_sched_job_arm() crashes on an entity that has rq set to NULL. > > Which is actually not the worst outcome :) > > With your patch applied we don't immediately crash any more in the > submission path, but the whole system could then later deadlock > because > the core memory management waits for a GPU submission which never > returns. > > That is an even worse situation because you then can't pinpoint any > more > where that is coming from. > > > The handling of NULL values is half-baked. > > > > In my opinion, you should define if drm_sched_pick_best() may put a > > NULL into > > rq. If your answer is yes, it might put a NULL there; then, there > > should be a > > BUG_ON(!entity->rq) after the invocation of > > drm_sched_entity_select_rq(). > > If your answer is no, the BUG_ON() should be in > > drm_sched_pick_best(). > > Yeah good point. > > We might not want a BUG_ON(), that is only justified when we prevent > further damage (e.g. random data corruption or similar). > > I suggest using a WARN(!shed, "Submission without activated > sheduler!"). > This way the system has at least a chance of survival should the > scheduler become ready later on. > > On the other hand the BUG_ON() or the NULL pointer deref should only > kill the application thread which is submitting something before the > driver is resumed. So that might help to pinpoint where the actually > issue is. As I see it the BUG_ON() would just be a more pretty NULL pointer deref. If we agree that this is effectively a misuse of the scheduler API we probably want to add it to make it more pretty, though? @Philipp: BTW, I only just discovered this thread by coincidence. Please use get_maintainer. The scheduler currently has 4 maintainers, and none of them is on CC. Danke, P. > > Regards, > Christian. > > > > > That helps guys with zero domain knowledge, like me, to figure out > > how > > this is all > > supposed to work. > > > > best regards, > > Philipp >