On Mon, 2024-09-09 at 13:29 +0200, Christian König wrote: > Am 09.09.24 um 11:44 schrieb Philipp Stanner: > > On Fri, 2024-09-06 at 19:06 +0100, Tvrtko Ursulin wrote: > > > From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx> > > > > > > Without the locking amdgpu currently can race > > > amdgpu_ctx_set_entity_priority() and drm_sched_job_arm(), > > I would explicitly say "amdgpu's amdgpu_ctx_set_entity_priority() > > races > > through drm_sched_entity_modify_sched() with drm_sched_job_arm()". > > > > The actual issue then seems to be drm_sched_job_arm() calling > > drm_sched_entity_select_rq(). I would mention that, too. > > > > > > > leading to the > > > latter accesing potentially inconsitent entity->sched_list and > > > entity->num_sched_list pair. > > > > > > The comment on drm_sched_entity_modify_sched() however says: > > > > > > """ > > > * Note that this must be called under the same common lock for > > > @entity as > > > * drm_sched_job_arm() and drm_sched_entity_push_job(), or the > > > driver > > > needs to > > > * guarantee through some other means that this is never called > > > while > > > new jobs > > > * can be pushed to @entity. > > > """ > > > > > > It is unclear if that is referring to this race or something > > > else. > > That comment is indeed a bit awkward. Both > > drm_sched_entity_push_job() > > and drm_sched_job_arm() take rq_lock. But > > drm_sched_entity_modify_sched() doesn't. > > > > The comment was written in 981b04d968561. Interestingly, in > > drm_sched_entity_push_job(), this "common lock" is mentioned with > > the > > soft requirement word "should" and apparently is more about keeping > > sequence numbers in order when inserting. > > > > I tend to think that the issue discovered by you is unrelated to > > that > > comment. But if no one can make sense of the comment, should it > > maybe > > be removed? Confusing comment is arguably worse than no comment > > Agree, we probably mixed up in 981b04d968561 that submission needs a > common lock and that rq/priority needs to be protected by the > rq_lock. > > There is also the big FIXME in the drm_sched_entity documentation > pointing out that this is most likely not implemented correctly. > > I suggest to move the rq, priority and rq_lock fields together in the > drm_sched_entity structure and document that rq_lock is protecting > the two. That could also be a great opportunity for improving the lock naming: void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts) { /* * Both locks need to be grabbed, one to protect from entity->rq change * for entity from within concurrent drm_sched_entity_select_rq and the * other to update the rb tree structure. */ spin_lock(&entity->rq_lock); spin_lock(&entity->rq->lock); [...] P. > > Then audit the code if all users of rq and priority actually hold the > correct locks while reading and writing them. > > Regards, > Christian. > > > > > P. > > > > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx> > > > Fixes: b37aced31eb0 ("drm/scheduler: implement a function to > > > modify > > > sched list") > > > Cc: Christian König <christian.koenig@xxxxxxx> > > > Cc: Alex Deucher <alexander.deucher@xxxxxxx> > > > Cc: Luben Tuikov <ltuikov89@xxxxxxxxx> > > > Cc: Matthew Brost <matthew.brost@xxxxxxxxx> > > > Cc: David Airlie <airlied@xxxxxxxxx> > > > Cc: Daniel Vetter <daniel@xxxxxxxx> > > > Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx > > > Cc: <stable@xxxxxxxxxxxxxxx> # v5.7+ > > > --- > > > drivers/gpu/drm/scheduler/sched_entity.c | 2 ++ > > > 1 file changed, 2 insertions(+) > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c > > > b/drivers/gpu/drm/scheduler/sched_entity.c > > > index 58c8161289fe..ae8be30472cd 100644 > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c > > > @@ -133,8 +133,10 @@ void drm_sched_entity_modify_sched(struct > > > drm_sched_entity *entity, > > > { > > > WARN_ON(!num_sched_list || !sched_list); > > > > > > + spin_lock(&entity->rq_lock); > > > entity->sched_list = sched_list; > > > entity->num_sched_list = num_sched_list; > > > + spin_unlock(&entity->rq_lock); > > > } > > > EXPORT_SYMBOL(drm_sched_entity_modify_sched); > > > >