On Wed, Dec 28, 2022 at 8:27 AM Rob Clark <robdclark@xxxxxxxxx> wrote: > > On Thu, Nov 17, 2022 at 7:12 AM Dmitry Osipenko > <dmitry.osipenko@xxxxxxxxxxxxx> wrote: > > > > On 11/17/22 18:09, Christian König wrote: > > > Am 17.11.22 um 15:41 schrieb Dmitry Osipenko: > > >> [SNIP] > > >>> drm_sched_entity_flush() should be called from the flush callback from > > >>> the file_operations structure of panfrost. See amdgpu_flush() and > > >>> amdgpu_ctx_mgr_entity_flush(). This makes sure that we wait for all > > >>> entities of the process/file descriptor to be flushed out. > > >>> > > >>> drm_sched_entity_fini() must be called before you free the memory the > > >>> entity structure or otherwise we would run into an use after free. > > >> Right, drm_sched_entity_destroy() invokes these two functions and > > >> Panfrost uses drm_sched_entity_destroy(). > > > > > > Than I have no idea what's going wrong here. > > > > > > The scheduler should trivially finish with the entity and call > > > complete(&entity->entity_idle) in it's main loop. No idea why this > > > doesn't happen. Can you investigate? > > > > I'll take a closer look. Hoped you may have a quick idea of what's wrong :) > > > > As Jonathan mentioned, the same thing is happening on msm. I can > reproduce this by adding an assert in mesa (in this case, triggered > after 100 draws) and running an app under gdb. After the assert is > hit, if I try to exit mesa, it hangs. > > The problem is that we somehow call drm_sched_entity_kill() twice. > The first time completes, but now the entity_idle completion is no > longer done, so the second call hangs forever. Maybe we should: ------ diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c index fe09e5be79bd..3d7c671d05e3 100644 --- a/drivers/gpu/drm/scheduler/sched_entity.c +++ b/drivers/gpu/drm/scheduler/sched_entity.c @@ -222,7 +226,6 @@ static void drm_sched_entity_kill(struct drm_sched_entity *entity) long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout) { struct drm_gpu_scheduler *sched; - struct task_struct *last_user; long ret = timeout; if (!entity->rq) @@ -244,12 +247,6 @@ long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout) drm_sched_entity_is_idle(entity)); } - /* For killed process disable any more IBs enqueue right now */ - last_user = cmpxchg(&entity->last_user, current->group_leader, NULL); - if ((!last_user || last_user == current->group_leader) && - (current->flags & PF_EXITING) && (current->exit_code == SIGKILL)) - drm_sched_entity_kill(entity); - return ret; } EXPORT_SYMBOL(drm_sched_entity_flush); ---- Maybe there is a better fix, but special handling for SIGKILL seems dubious to me (vs just relying on the drm device fd close path). I wonder if that code path was tested at all? BR, -R