Re: [RFC 1/4] drm/sched: Add locking to drm_sched_entity_modify_sched

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 09/09/2024 13:46, Philipp Stanner wrote:
On Mon, 2024-09-09 at 13:37 +0100, Tvrtko Ursulin wrote:

On 09/09/2024 13:18, Christian König wrote:
Am 09.09.24 um 14:13 schrieb Philipp Stanner:
On Mon, 2024-09-09 at 13:29 +0200, Christian König wrote:
Am 09.09.24 um 11:44 schrieb Philipp Stanner:
On Fri, 2024-09-06 at 19:06 +0100, Tvrtko Ursulin wrote:
From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx>

Without the locking amdgpu currently can race
amdgpu_ctx_set_entity_priority() and drm_sched_job_arm(),
I would explicitly say "amdgpu's
amdgpu_ctx_set_entity_priority()
races
through drm_sched_entity_modify_sched() with
drm_sched_job_arm()".

The actual issue then seems to be drm_sched_job_arm() calling
drm_sched_entity_select_rq(). I would mention that, too.


leading to the
latter accesing potentially inconsitent entity->sched_list
and
entity->num_sched_list pair.

The comment on drm_sched_entity_modify_sched() however
says:

"""
    * Note that this must be called under the same common
lock for
@entity as
    * drm_sched_job_arm() and drm_sched_entity_push_job(),
or the
driver
needs to
    * guarantee through some other means that this is never
called
while
new jobs
    * can be pushed to @entity.
"""

It is unclear if that is referring to this race or
something
else.
That comment is indeed a bit awkward. Both
drm_sched_entity_push_job()
and drm_sched_job_arm() take rq_lock. But
drm_sched_entity_modify_sched() doesn't.

The comment was written in 981b04d968561. Interestingly, in
drm_sched_entity_push_job(), this "common lock" is mentioned
with
the
soft requirement word "should" and apparently is more about
keeping
sequence numbers in order when inserting.

I tend to think that the issue discovered by you is unrelated
to
that
comment. But if no one can make sense of the comment, should
it
maybe
be removed? Confusing comment is arguably worse than no
comment
Agree, we probably mixed up in 981b04d968561 that submission
needs a
common lock and that rq/priority needs to be protected by the
rq_lock.

There is also the big FIXME in the drm_sched_entity
documentation
pointing out that this is most likely not implemented
correctly.

I suggest to move the rq, priority and rq_lock fields together
in the
drm_sched_entity structure and document that rq_lock is
protecting
the two.
That could also be a great opportunity for improving the lock
naming:

Well that comment made me laugh because I point out the same when
the
scheduler came out ~8years ago and nobody cared about it since
then.

But yeah completely agree :)

Maybe, but we need to keep in sight the fact some of these fixes may
be
good to backport. In which case re-naming exercises are best left to
follow.

My argument basically. It's good if fixes and other improvements are
separated, in general, unless there is a practical / good reason not
to.

Ah cool, I am happy to add follow up patches after the fixes.

Also..

void drm_sched_rq_update_fifo(struct drm_sched_entity *entity,
ktime_t
ts)
{
     /*
      * Both locks need to be grabbed, one to protect from entity-
rq
change
      * for entity from within concurrent
drm_sched_entity_select_rq
and the
      * other to update the rb tree structure.
      */
     spin_lock(&entity->rq_lock);
     spin_lock(&entity->rq->lock);

.. I agree this is quite unredable and my initial reaction was a
similar
ugh. However.. What names would you guys suggest and for what to make
this better and not lessen the logic of naming each individually?

According to the documentation, drm_sched_rq.lock does not protect the
entire runque, but "@lock: to modify the entities list."

So I would keep drm_sched_entity.rq_lock as it is, because it indeed
protects the entire runqueue.

Agreed on entity->rq_lock.

And drm_sched_rq.lock could be named "entities_lock" or
"entities_list_lock" or something. That's debatable, but it should be
something that highlights that this lock is not for locking the entire
runque as the one in the entity apparently is.

AFAICT it also protects rq->current_entity and rq->rb_tree_root in which case it is a bit more tricky. Only rq->sched is outside its scope. Hm. Maybe just re-arrange the struct to be like:

struct drm_sched_rq {
	struct drm_gpu_scheduler	*sched;

	spinlock_t			lock;
	/* Following members are protected by the @lock: */
	struct list_head		entities;
	struct drm_sched_entity		*current_entity;
	struct rb_root_cached		rb_tree_root;
};

I have no ideas for better naming. But this would be inline with Christian's suggestion for tidying the order in drm_sched_entity.

I am also not sure what is the point of setting rq->current_entity in drm_sched_rq_select_entity_fifo().

Regards,

Tvrtko



Cheers,
P.


Regards,

Tvrtko

[...]


P.


Then audit the code if all users of rq and priority actually
hold the
correct locks while reading and writing them.

Regards,
Christian.

P.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx>
Fixes: b37aced31eb0 ("drm/scheduler: implement a function
to
modify
sched list")
Cc: Christian König <christian.koenig@xxxxxxx>
Cc: Alex Deucher <alexander.deucher@xxxxxxx>
Cc: Luben Tuikov <ltuikov89@xxxxxxxxx>
Cc: Matthew Brost <matthew.brost@xxxxxxxxx>
Cc: David Airlie <airlied@xxxxxxxxx>
Cc: Daniel Vetter <daniel@xxxxxxxx>
Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx
Cc: <stable@xxxxxxxxxxxxxxx> # v5.7+
---
    drivers/gpu/drm/scheduler/sched_entity.c | 2 ++
    1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
b/drivers/gpu/drm/scheduler/sched_entity.c
index 58c8161289fe..ae8be30472cd 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -133,8 +133,10 @@ void
drm_sched_entity_modify_sched(struct
drm_sched_entity *entity,
    {
        WARN_ON(!num_sched_list || !sched_list);
+    spin_lock(&entity->rq_lock);
        entity->sched_list = sched_list;
        entity->num_sched_list = num_sched_list;
+    spin_unlock(&entity->rq_lock);
    }
    EXPORT_SYMBOL(drm_sched_entity_modify_sched);






[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux