On 19/04/2023 10:44, Lucas Stach wrote: > Hi Steven, > > Am Mittwoch, dem 19.04.2023 um 10:39 +0100 schrieb Steven Price: >> On 18/04/2023 11:04, Danilo Krummrich wrote: >>> It already happend a few times that patches slipped through which >>> implemented access to an entity through a job that was already removed >>> from the entities queue. Since jobs and entities might have different >>> lifecycles, this can potentially cause UAF bugs. >>> >>> In order to make it obvious that a jobs entity pointer shouldn't be >>> accessed after drm_sched_entity_pop_job() was called successfully, set >>> the jobs entity pointer to NULL once the job is removed from the entity >>> queue. >>> >>> Moreover, debugging a potential NULL pointer dereference is way easier >>> than potentially corrupted memory through a UAF. >>> >>> Signed-off-by: Danilo Krummrich <dakr@xxxxxxxxxx> >> >> This triggers a splat for me (with Panfrost driver), the cause of which >> is the following code in drm_sched_get_cleanup_job(): >> >> if (job) { >> job->entity->elapsed_ns += ktime_to_ns( >> ktime_sub(job->s_fence->finished.timestamp, >> job->s_fence->scheduled.timestamp)); >> } >> >> which indeed is accessing entity after the job has been returned from >> drm_sched_entity_pop_job(). And obviously job->entity is a NULL pointer >> with this patch. >> >> I'm afraid I don't fully understand the lifecycle so I'm not sure if >> this is simply exposing an existing bug in drm_sched_get_cleanup_job() >> or if this commit needs to be reverted. >> > Not sure which tree you are on. The offending commit has been reverted > in 6.3-rc5. This is in drm-misc-next - I'm not sure which "offending commit" you are referring to. I'm referring to: 96c7c2f4d5bd ("drm/scheduler: set entity to NULL in drm_sched_entity_pop_job()") which was merged yesterday to drm-misc-next (and is currently the top commit). Is there another commit which has been reverted elsewhere which is conflicting? Steve > Regards, > Lucas > >> Thanks, >> >> Steve >> >>> --- >>> drivers/gpu/drm/scheduler/sched_entity.c | 6 ++++++ >>> drivers/gpu/drm/scheduler/sched_main.c | 4 ++++ >>> 2 files changed, 10 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c >>> index 15d04a0ec623..a9c6118e534b 100644 >>> --- a/drivers/gpu/drm/scheduler/sched_entity.c >>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c >>> @@ -448,6 +448,12 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity) >>> drm_sched_rq_update_fifo(entity, next->submit_ts); >>> } >>> >>> + /* Jobs and entities might have different lifecycles. Since we're >>> + * removing the job from the entities queue, set the jobs entity pointer >>> + * to NULL to prevent any future access of the entity through this job. >>> + */ >>> + sched_job->entity = NULL; >>> + >>> return sched_job; >>> } >>> >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >>> index 9b16480686f6..e89a3e469cd5 100644 >>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>> @@ -42,6 +42,10 @@ >>> * the hardware. >>> * >>> * The jobs in a entity are always scheduled in the order that they were pushed. >>> + * >>> + * Note that once a job was taken from the entities queue and pushed to the >>> + * hardware, i.e. the pending queue, the entity must not be referenced anymore >>> + * through the jobs entity pointer. >>> */ >>> >>> #include <linux/kthread.h> >> >