On Mon, 2024-09-09 at 13:37 +0100, Tvrtko Ursulin wrote: > > On 09/09/2024 13:18, Christian König wrote: > > Am 09.09.24 um 14:13 schrieb Philipp Stanner: > > > On Mon, 2024-09-09 at 13:29 +0200, Christian König wrote: > > > > Am 09.09.24 um 11:44 schrieb Philipp Stanner: > > > > > On Fri, 2024-09-06 at 19:06 +0100, Tvrtko Ursulin wrote: > > > > > > From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx> > > > > > > > > > > > > Without the locking amdgpu currently can race > > > > > > amdgpu_ctx_set_entity_priority() and drm_sched_job_arm(), > > > > > I would explicitly say "amdgpu's > > > > > amdgpu_ctx_set_entity_priority() > > > > > races > > > > > through drm_sched_entity_modify_sched() with > > > > > drm_sched_job_arm()". > > > > > > > > > > The actual issue then seems to be drm_sched_job_arm() calling > > > > > drm_sched_entity_select_rq(). I would mention that, too. > > > > > > > > > > > > > > > > leading to the > > > > > > latter accesing potentially inconsitent entity->sched_list > > > > > > and > > > > > > entity->num_sched_list pair. > > > > > > > > > > > > The comment on drm_sched_entity_modify_sched() however > > > > > > says: > > > > > > > > > > > > """ > > > > > > * Note that this must be called under the same common > > > > > > lock for > > > > > > @entity as > > > > > > * drm_sched_job_arm() and drm_sched_entity_push_job(), > > > > > > or the > > > > > > driver > > > > > > needs to > > > > > > * guarantee through some other means that this is never > > > > > > called > > > > > > while > > > > > > new jobs > > > > > > * can be pushed to @entity. > > > > > > """ > > > > > > > > > > > > It is unclear if that is referring to this race or > > > > > > something > > > > > > else. > > > > > That comment is indeed a bit awkward. Both > > > > > drm_sched_entity_push_job() > > > > > and drm_sched_job_arm() take rq_lock. But > > > > > drm_sched_entity_modify_sched() doesn't. > > > > > > > > > > The comment was written in 981b04d968561. Interestingly, in > > > > > drm_sched_entity_push_job(), this "common lock" is mentioned > > > > > with > > > > > the > > > > > soft requirement word "should" and apparently is more about > > > > > keeping > > > > > sequence numbers in order when inserting. > > > > > > > > > > I tend to think that the issue discovered by you is unrelated > > > > > to > > > > > that > > > > > comment. But if no one can make sense of the comment, should > > > > > it > > > > > maybe > > > > > be removed? Confusing comment is arguably worse than no > > > > > comment > > > > Agree, we probably mixed up in 981b04d968561 that submission > > > > needs a > > > > common lock and that rq/priority needs to be protected by the > > > > rq_lock. > > > > > > > > There is also the big FIXME in the drm_sched_entity > > > > documentation > > > > pointing out that this is most likely not implemented > > > > correctly. > > > > > > > > I suggest to move the rq, priority and rq_lock fields together > > > > in the > > > > drm_sched_entity structure and document that rq_lock is > > > > protecting > > > > the two. > > > That could also be a great opportunity for improving the lock > > > naming: > > > > Well that comment made me laugh because I point out the same when > > the > > scheduler came out ~8years ago and nobody cared about it since > > then. > > > > But yeah completely agree :) > > Maybe, but we need to keep in sight the fact some of these fixes may > be > good to backport. In which case re-naming exercises are best left to > follow. My argument basically. It's good if fixes and other improvements are separated, in general, unless there is a practical / good reason not to. > > Also.. > > > > void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, > > > ktime_t > > > ts) > > > { > > > /* > > > * Both locks need to be grabbed, one to protect from entity- > > > >rq > > > change > > > * for entity from within concurrent > > > drm_sched_entity_select_rq > > > and the > > > * other to update the rb tree structure. > > > */ > > > spin_lock(&entity->rq_lock); > > > spin_lock(&entity->rq->lock); > > .. I agree this is quite unredable and my initial reaction was a > similar > ugh. However.. What names would you guys suggest and for what to make > this better and not lessen the logic of naming each individually? According to the documentation, drm_sched_rq.lock does not protect the entire runque, but "@lock: to modify the entities list." So I would keep drm_sched_entity.rq_lock as it is, because it indeed protects the entire runqueue. And drm_sched_rq.lock could be named "entities_lock" or "entities_list_lock" or something. That's debatable, but it should be something that highlights that this lock is not for locking the entire runque as the one in the entity apparently is. Cheers, P. > > Regards, > > Tvrtko > > > > [...] > > > > > > > > > P. > > > > > > > > > > Then audit the code if all users of rq and priority actually > > > > hold the > > > > correct locks while reading and writing them. > > > > > > > > Regards, > > > > Christian. > > > > > > > > > P. > > > > > > > > > > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx> > > > > > > Fixes: b37aced31eb0 ("drm/scheduler: implement a function > > > > > > to > > > > > > modify > > > > > > sched list") > > > > > > Cc: Christian König <christian.koenig@xxxxxxx> > > > > > > Cc: Alex Deucher <alexander.deucher@xxxxxxx> > > > > > > Cc: Luben Tuikov <ltuikov89@xxxxxxxxx> > > > > > > Cc: Matthew Brost <matthew.brost@xxxxxxxxx> > > > > > > Cc: David Airlie <airlied@xxxxxxxxx> > > > > > > Cc: Daniel Vetter <daniel@xxxxxxxx> > > > > > > Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx > > > > > > Cc: <stable@xxxxxxxxxxxxxxx> # v5.7+ > > > > > > --- > > > > > > drivers/gpu/drm/scheduler/sched_entity.c | 2 ++ > > > > > > 1 file changed, 2 insertions(+) > > > > > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c > > > > > > b/drivers/gpu/drm/scheduler/sched_entity.c > > > > > > index 58c8161289fe..ae8be30472cd 100644 > > > > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c > > > > > > @@ -133,8 +133,10 @@ void > > > > > > drm_sched_entity_modify_sched(struct > > > > > > drm_sched_entity *entity, > > > > > > { > > > > > > WARN_ON(!num_sched_list || !sched_list); > > > > > > + spin_lock(&entity->rq_lock); > > > > > > entity->sched_list = sched_list; > > > > > > entity->num_sched_list = num_sched_list; > > > > > > + spin_unlock(&entity->rq_lock); > > > > > > } > > > > > > EXPORT_SYMBOL(drm_sched_entity_modify_sched); > > >