On 2023-10-31 22:23, Danilo Krummrich wrote: > Hi Luben, > [snip] >>> @@ -187,12 +251,14 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq, >>> /** >>> * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run >>> * >>> + * @sched: the gpu scheduler >>> * @rq: scheduler run queue to check. >>> * >>> * Try to find a ready entity, returns NULL if none found. >>> */ >>> static struct drm_sched_entity * >>> -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq) >>> +drm_sched_rq_select_entity_rr(struct drm_gpu_scheduler *sched, >>> + struct drm_sched_rq *rq) >>> { >>> struct drm_sched_entity *entity; >>> >>> @@ -202,6 +268,14 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq) >>> if (entity) { >>> list_for_each_entry_continue(entity, &rq->entities, list) { >>> if (drm_sched_entity_is_ready(entity)) { >>> + /* If we can't queue yet, preserve the current >>> + * entity in terms of fairness. >>> + */ >>> + if (!drm_sched_can_queue(sched, entity)) { >>> + spin_unlock(&rq->lock); >>> + return ERR_PTR(-ENOSPC); >>> + } >>> + >>> rq->current_entity = entity; >>> reinit_completion(&entity->entity_idle); >>> spin_unlock(&rq->lock); >>> @@ -211,8 +285,15 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq) >>> } >>> >>> list_for_each_entry(entity, &rq->entities, list) { >>> - >>> if (drm_sched_entity_is_ready(entity)) { >>> + /* If we can't queue yet, preserve the current entity in >>> + * terms of fairness. >>> + */ >>> + if (!drm_sched_can_queue(sched, entity)) { >>> + spin_unlock(&rq->lock); >>> + return ERR_PTR(-ENOSPC); >>> + } >>> + >>> rq->current_entity = entity; >>> reinit_completion(&entity->entity_idle); >>> spin_unlock(&rq->lock); >>> @@ -231,12 +312,14 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq) >>> /** >>> * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run >>> * >>> + * @sched: the gpu scheduler >>> * @rq: scheduler run queue to check. >>> * >>> * Find oldest waiting ready entity, returns NULL if none found. >>> */ >>> static struct drm_sched_entity * >>> -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq) >>> +drm_sched_rq_select_entity_fifo(struct drm_gpu_scheduler *sched, >>> + struct drm_sched_rq *rq) >>> { >>> struct rb_node *rb; >>> >>> @@ -246,6 +329,14 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq) >>> >>> entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node); >>> if (drm_sched_entity_is_ready(entity)) { >>> + /* If we can't queue yet, preserve the current entity in >>> + * terms of fairness. >>> + */ >>> + if (!drm_sched_can_queue(sched, entity)) { >>> + spin_unlock(&rq->lock); >>> + return ERR_PTR(-ENOSPC); >>> + } >>> + >> >> So, this kinda did this abrupt "return" in v2, then it was fixed fine in v3, >> and now I see we return and return an error now, which doesn't seem to be used >> by anyone. In fact, in drm_sched_select_entity(), we do this, >> >> - return entity; >> + return IS_ERR(entity) ? NULL : entity; >> >> So, it's perhaps best to leave this as it was in v3, and if in the future >> we need to distinguish between the type of error, then that future patch >> would do that and also show how this is used with new code and logic. >> >> There is little value to over-engineer this right now, given that in >> the future, the logic may be more esoteric than we can think of. > > Ok, maybe what I do here is a little bit subtle and requires a comment. Let me > explain. > > The reason I return an ERR_PTR() instead of NULL is to indicate to > drm_sched_select_entity() to break out of the runqueue loop > (drm_sched_select_entity() breaks the loop when the returned entity is not > NULL). > > Without that, we'd continue the runqueue loop in drm_sched_select_entity() and > retry with the next lower priority. This could lead to prioritiy inversion of > the runqueues, since a lower priority runqueue with jobs with less credits could > stall a higher priority runqueue with jobs with more credits. > > Hence, in drm_sched_select_entity() we need to be able to distinguish between > drm_sched_rq_select_entity_fifo()/drm_sched_rq_select_entity_rr() can't find an > entity and they *can* find an entity, but it's job doesn't fit on the ring. > > I think what makes it subtle in this patch is that in drm_sched_select_entity() > the condition already fits with > > if (entity) > break; > > because we want to break this loop when we actually found an entity, or when > there is no space on the ring buffer, but we want to keep checking the other > runqueues if entity is NULL. Okay, that's fine, but please update the head comment of drm_sched_rq_select_entity_{rr,fifo}() and of drm_sched_select_entity() to explain what is being returned. This invariably adds to the documentation which some want added to DRM--let's first all start documenting code which we add ourselves. I'd imagine this would look something like this, for each _{rr,fifo}, respectively, (remove content with braces {}, x4), /** * drm_sched_rq_select_entity_{LT: rr,fifo} - Select an entity which provides a job to run * @sched: the gpu scheduler * @rq: scheduler run queue to check. * * Try to find a ready entity, returns NULL if none found. {LT: RR} * Find oldest waiting ready entity, returns NULL if none found. {LT: FIFO} * * Return an entity if one is found; return an error-pointer (!NULL) if an entity * was ready, but the scheduler had insufficient credits to accommodate its job; * return NULL, if no ready entity was found. {LT: for both RR and FIFO.} */ And, /** * drm_sched_select_entity - Select the next entity to process * @sched: the scheduler instance * * Return an entity to process or NULL if none are found. * * Note, that we break out of the for-loop when "entity" * is non-null, which can also be an error-pointer--this assures * we don't process lower priority run-queues. See comments * in the respectively called functions. */ [snip] >>> + >>> + /** >>> + * @update_job_credits: Called once the scheduler is considering this >>> + * job for execution. >> >> "once" --> "when", a close cousin of "once", but clearer in code comments. >> It is called in fact as many times as the job is considered to be pushed down >> to hardware, if we couldn't previously. > > Sure, gonna change that. > >> >>> + * >>> + * This callback returns the number of credits this job would take if >> >> Too many repetitions of "this". Instead of "this job", say "the job". > > Gonna fix. Great. Could you please rebase your patch on top of drm-misc-next--Matthew's Xe changes went in this afternoon. Thanks! -- Regards, Luben
Attachment:
OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key
Attachment:
OpenPGP_signature.asc
Description: OpenPGP digital signature