Re: [RFC 1/4] drm/sched: Add locking to drm_sched_entity_modify_sched

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 2024-09-09 at 13:37 +0100, Tvrtko Ursulin wrote:
> 
> On 09/09/2024 13:18, Christian König wrote:
> > Am 09.09.24 um 14:13 schrieb Philipp Stanner:
> > > On Mon, 2024-09-09 at 13:29 +0200, Christian König wrote:
> > > > Am 09.09.24 um 11:44 schrieb Philipp Stanner:
> > > > > On Fri, 2024-09-06 at 19:06 +0100, Tvrtko Ursulin wrote:
> > > > > > From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx>
> > > > > > 
> > > > > > Without the locking amdgpu currently can race
> > > > > > amdgpu_ctx_set_entity_priority() and drm_sched_job_arm(),
> > > > > I would explicitly say "amdgpu's
> > > > > amdgpu_ctx_set_entity_priority()
> > > > > races
> > > > > through drm_sched_entity_modify_sched() with
> > > > > drm_sched_job_arm()".
> > > > > 
> > > > > The actual issue then seems to be drm_sched_job_arm() calling
> > > > > drm_sched_entity_select_rq(). I would mention that, too.
> > > > > 
> > > > > 
> > > > > > leading to the
> > > > > > latter accesing potentially inconsitent entity->sched_list
> > > > > > and
> > > > > > entity->num_sched_list pair.
> > > > > > 
> > > > > > The comment on drm_sched_entity_modify_sched() however
> > > > > > says:
> > > > > > 
> > > > > > """
> > > > > >    * Note that this must be called under the same common
> > > > > > lock for
> > > > > > @entity as
> > > > > >    * drm_sched_job_arm() and drm_sched_entity_push_job(),
> > > > > > or the
> > > > > > driver
> > > > > > needs to
> > > > > >    * guarantee through some other means that this is never
> > > > > > called
> > > > > > while
> > > > > > new jobs
> > > > > >    * can be pushed to @entity.
> > > > > > """
> > > > > > 
> > > > > > It is unclear if that is referring to this race or
> > > > > > something
> > > > > > else.
> > > > > That comment is indeed a bit awkward. Both
> > > > > drm_sched_entity_push_job()
> > > > > and drm_sched_job_arm() take rq_lock. But
> > > > > drm_sched_entity_modify_sched() doesn't.
> > > > > 
> > > > > The comment was written in 981b04d968561. Interestingly, in
> > > > > drm_sched_entity_push_job(), this "common lock" is mentioned
> > > > > with
> > > > > the
> > > > > soft requirement word "should" and apparently is more about
> > > > > keeping
> > > > > sequence numbers in order when inserting.
> > > > > 
> > > > > I tend to think that the issue discovered by you is unrelated
> > > > > to
> > > > > that
> > > > > comment. But if no one can make sense of the comment, should
> > > > > it
> > > > > maybe
> > > > > be removed? Confusing comment is arguably worse than no
> > > > > comment
> > > > Agree, we probably mixed up in 981b04d968561 that submission
> > > > needs a
> > > > common lock and that rq/priority needs to be protected by the
> > > > rq_lock.
> > > > 
> > > > There is also the big FIXME in the drm_sched_entity
> > > > documentation
> > > > pointing out that this is most likely not implemented
> > > > correctly.
> > > > 
> > > > I suggest to move the rq, priority and rq_lock fields together
> > > > in the
> > > > drm_sched_entity structure and document that rq_lock is
> > > > protecting
> > > > the two.
> > > That could also be a great opportunity for improving the lock
> > > naming:
> > 
> > Well that comment made me laugh because I point out the same when
> > the 
> > scheduler came out ~8years ago and nobody cared about it since
> > then.
> > 
> > But yeah completely agree :)
> 
> Maybe, but we need to keep in sight the fact some of these fixes may
> be 
> good to backport. In which case re-naming exercises are best left to
> follow.

My argument basically. It's good if fixes and other improvements are
separated, in general, unless there is a practical / good reason not
to.

> 
> Also..
> 
> > > void drm_sched_rq_update_fifo(struct drm_sched_entity *entity,
> > > ktime_t 
> > > ts)
> > > {
> > >     /*
> > >      * Both locks need to be grabbed, one to protect from entity-
> > > >rq 
> > > change
> > >      * for entity from within concurrent
> > > drm_sched_entity_select_rq 
> > > and the
> > >      * other to update the rb tree structure.
> > >      */
> > >     spin_lock(&entity->rq_lock);
> > >     spin_lock(&entity->rq->lock);
> 
> .. I agree this is quite unredable and my initial reaction was a
> similar 
> ugh. However.. What names would you guys suggest and for what to make
> this better and not lessen the logic of naming each individually?

According to the documentation, drm_sched_rq.lock does not protect the
entire runque, but "@lock: to modify the entities list."

So I would keep drm_sched_entity.rq_lock as it is, because it indeed
protects the entire runqueue.

And drm_sched_rq.lock could be named "entities_lock" or
"entities_list_lock" or something. That's debatable, but it should be
something that highlights that this lock is not for locking the entire
runque as the one in the entity apparently is.


Cheers,
P.

> 
> Regards,
> 
> Tvrtko
> 
> > > [...]
> > > 
> > > 
> > > P.
> > > 
> > > 
> > > > Then audit the code if all users of rq and priority actually
> > > > hold the
> > > > correct locks while reading and writing them.
> > > > 
> > > > Regards,
> > > > Christian.
> > > > 
> > > > > P.
> > > > > 
> > > > > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx>
> > > > > > Fixes: b37aced31eb0 ("drm/scheduler: implement a function
> > > > > > to
> > > > > > modify
> > > > > > sched list")
> > > > > > Cc: Christian König <christian.koenig@xxxxxxx>
> > > > > > Cc: Alex Deucher <alexander.deucher@xxxxxxx>
> > > > > > Cc: Luben Tuikov <ltuikov89@xxxxxxxxx>
> > > > > > Cc: Matthew Brost <matthew.brost@xxxxxxxxx>
> > > > > > Cc: David Airlie <airlied@xxxxxxxxx>
> > > > > > Cc: Daniel Vetter <daniel@xxxxxxxx>
> > > > > > Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx
> > > > > > Cc: <stable@xxxxxxxxxxxxxxx> # v5.7+
> > > > > > ---
> > > > > >    drivers/gpu/drm/scheduler/sched_entity.c | 2 ++
> > > > > >    1 file changed, 2 insertions(+)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > index 58c8161289fe..ae8be30472cd 100644
> > > > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > @@ -133,8 +133,10 @@ void
> > > > > > drm_sched_entity_modify_sched(struct
> > > > > > drm_sched_entity *entity,
> > > > > >    {
> > > > > >        WARN_ON(!num_sched_list || !sched_list);
> > > > > > +    spin_lock(&entity->rq_lock);
> > > > > >        entity->sched_list = sched_list;
> > > > > >        entity->num_sched_list = num_sched_list;
> > > > > > +    spin_unlock(&entity->rq_lock);
> > > > > >    }
> > > > > >    EXPORT_SYMBOL(drm_sched_entity_modify_sched);
> > 
> 






[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux