So, I started fixing this, including the bug taking the next element as an entity, but it could be actually the list_head... a la your patch being fixed, and then went down the rabbit whole of also fixing drm_sched_rq_select_entity(), but the problem is that at that point we don't know if we should start from the _next_ entity (as it is currently the case) or from the current entity (a la list_for_each_entry_from()) as it would be the case with this patch (if it were fixed for the list_head bug). But the problem is that elsewhere in the GPU scheduler (sched_entity.c), we do want to start from rq->current_entity->next, and picking "next" in drm_sched_rq_remove_entity(), would then skip an entity, or be biased for an entity twice. This is why this function is called drm_sched_rq_remove_entity() and not drm_sched_rq_next_entity_or_null(). So all this work seemed moot, given that we've already switched to FIFO-based scheduling in drm-misc-next, and so I didn't see a point in developing this further at this point (it's been working alright)--we're going with FIFO-based scheduling. Regards, Luben On 2022-10-27 05:08, Christian König wrote: > Am 27.10.22 um 11:00 schrieb broler Liew: >> It's very nice of you-all to finger it out that it may crash when it is the last entity in the list. Absolutely I made a mistake about that. >> But I still cannot understand why we need to restart the selection from the list head when the current entity is removed from rq. >> In drm_sched_rq_select_entity, starting from head may cause the first entity to be selected more often than others, which breaks the equal rule the scheduler wants to achieve. >> Maybe the previous one is the better choice when current_entity == entity? > > That's a good argument, but we want to get rid of the round robin algorithm anyway and switch over to the fifo. > > So this is some code which is already not used by default any more and improving it doesn't make much sense. > > Regards, > Christian. > >> >> Luben Tuikov <luben.tuikov@xxxxxxx> 于2022年10月27日周四 16:24写道: >> >> On 2022-10-27 04:19, Christian König wrote: >> > Am 27.10.22 um 10:07 schrieb Luben Tuikov: >> >> On 2022-10-27 03:01, Luben Tuikov wrote: >> >>> On 2022-10-25 13:50, Luben Tuikov wrote: >> >>>> Looking... >> >>>> >> >>>> Regards, >> >>>> Luben >> >>>> >> >>>> On 2022-10-25 09:35, Alex Deucher wrote: >> >>>>> + Luben >> >>>>> >> >>>>> On Tue, Oct 25, 2022 at 2:55 AM brolerliew <brolerliew@xxxxxxxxx> wrote: >> >>>>>> When entity move from one rq to another, current_entity will be set to NULL >> >>>>>> if it is the moving entity. This make entities close to rq head got >> >>>>>> selected more frequently, especially when doing load balance between >> >>>>>> multiple drm_gpu_scheduler. >> >>>>>> >> >>>>>> Make current_entity to next when removing from rq. >> >>>>>> >> >>>>>> Signed-off-by: brolerliew <brolerliew@xxxxxxxxx> >> >>>>>> --- >> >>>>>> drivers/gpu/drm/scheduler/sched_main.c | 5 +++-- >> >>>>>> 1 file changed, 3 insertions(+), 2 deletions(-) >> >>>>>> >> >>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >> >>>>>> index 2fab218d7082..00b22cc50f08 100644 >> >>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c >> >>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >> >>>>>> @@ -168,10 +168,11 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq, >> >>>>>> spin_lock(&rq->lock); >> >>>>>> >> >>>>>> atomic_dec(rq->sched->score); >> >>>>>> - list_del_init(&entity->list); >> >>>>>> >> >>>>>> if (rq->current_entity == entity) >> >>>>>> - rq->current_entity = NULL; >> >>>>>> + rq->current_entity = list_next_entry(entity, list); >> >>>>>> + >> >>>>>> + list_del_init(&entity->list); >> >>>>>> >> >>>>>> if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) >> >>>>>> drm_sched_rq_remove_fifo_locked(entity); >> >>>>>> -- >> >>>>>> 2.34.1 >> >>>>>> >> >>> Looks good. I'll pick it up into some other changes I've in tow, and repost >> >>> along with my changes, as they're somewhat related. >> >> Actually, the more I look at it, the more I think that we do want to set >> >> rq->current_entity to NULL in that function, in order to pick the next best entity >> >> (or scheduler for that matter), the next time around. See sched_entity.c, >> >> and drm_sched_rq_select_entity() where we start evaluating from the _next_ >> >> entity. >> >> >> >> So, it is best to leave it to set it to NULL, for now. >> > >> > Apart from that this patch here could cause a crash when the entity is >> > the last one in the list. >> > >> > In this case current current_entity would be set to an incorrect upcast >> > of the head of the list. >> >> Absolutely. I saw that, but in rejecting the patch, I didn't feel the need to mention it. >> >> Thanks for looking into this. >> >> Regards, >> Luben >> >