On 2022-08-23 14:57, Andrey Grodzovsky wrote: > On 2022-08-23 14:30, Luben Tuikov wrote: > >> >> On 2022-08-23 14:13, Andrey Grodzovsky wrote: >>> On 2022-08-23 12:58, Luben Tuikov wrote: >>>> Inlined: >>>> >>>> On 2022-08-22 16:09, Andrey Grodzovsky wrote: >>>>> Poblem: Given many entities competing for same rq on >>>> ^Problem >>>> >>>>> same scheduler an uncceptabliy long wait time for some >>>> ^unacceptably >>>> >>>>> jobs waiting stuck in rq before being picked up are >>>>> observed (seen using GPUVis). >>>>> The issue is due to Round Robin policy used by scheduler >>>>> to pick up the next entity for execution. Under stress >>>>> of many entities and long job queus within entity some >>>> ^queues >>>> >>>>> jobs could be stack for very long time in it's entity's >>>>> queue before being popped from the queue and executed >>>>> while for other entites with samller job queues a job >>>> ^entities; smaller >>>> >>>>> might execute ealier even though that job arrived later >>>> ^earlier >>>> >>>>> then the job in the long queue. >>>>> >>>>> Fix: >>>>> Add FIFO selection policy to entites in RQ, chose next enitity >>>>> on rq in such order that if job on one entity arrived >>>>> ealrier then job on another entity the first job will start >>>>> executing ealier regardless of the length of the entity's job >>>>> queue. >>>>> >>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> >>>>> Tested-by: Li Yunxiang (Teddy) <Yunxiang.Li@xxxxxxx> >>>>> --- >>>>> drivers/gpu/drm/scheduler/sched_entity.c | 2 + >>>>> drivers/gpu/drm/scheduler/sched_main.c | 65 ++++++++++++++++++++++-- >>>>> include/drm/gpu_scheduler.h | 8 +++ >>>>> 3 files changed, 71 insertions(+), 4 deletions(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c >>>>> index 6b25b2f4f5a3..3bb7f69306ef 100644 >>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c >>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c >>>>> @@ -507,6 +507,8 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job) >>>>> atomic_inc(entity->rq->sched->score); >>>>> WRITE_ONCE(entity->last_user, current->group_leader); >>>>> first = spsc_queue_push(&entity->job_queue, &sched_job->queue_node); >>>>> + sched_job->submit_ts = ktime_get(); >>>>> + >>>>> >>>>> /* first job wakes up scheduler */ >>>>> if (first) { >>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >>>>> index 68317d3a7a27..c123aa120d06 100644 >>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>>>> @@ -59,6 +59,19 @@ >>>>> #define CREATE_TRACE_POINTS >>>>> #include "gpu_scheduler_trace.h" >>>>> >>>>> + >>>>> + >>>>> +int drm_sched_policy = -1; >>>>> + >>>>> +/** >>>>> + * DOC: sched_policy (int) >>>>> + * Used to override default entites scheduling policy in a run queue. >>>>> + */ >>>>> +MODULE_PARM_DESC(sched_policy, >>>>> + "specify schedule policy for entites on a runqueue (-1 = auto(default) value, 0 = Round Robin,1 = use FIFO"); >>>>> +module_param_named(sched_policy, drm_sched_policy, int, 0444); >>>> As per Christian's comments, you can drop the "auto" and perhaps leave one as the default, >>>> say the RR. >>>> >>>> I do think it is beneficial to have a module parameter control the scheduling policy, as shown above. >>> >>> Christian is not against it, just against adding 'auto' here - like the >>> default. >> Exactly what I said. >> >> Also, I still think an O(1) scheduling (picking next to run) should be >> what we strive for in such a FIFO patch implementation. >> A FIFO mechanism is by it's nature an O(1) mechanism for picking the next >> element. >> >> Regards, >> Luben > > > The only solution i see for this now is keeping a global per rq jobs > list parallel to SPCP queue per entity - we use this list when we switch > to FIFO scheduling, we can even start building it ONLY when we switch > to FIFO building it gradually as more jobs come. Do you have other solution > in mind ? The idea is to "sort" on insertion, not on picking the next one to run. cont'd below: > > Andrey > >> >>> >>>>> + >>>>> + >>>>> #define to_drm_sched_job(sched_job) \ >>>>> container_of((sched_job), struct drm_sched_job, queue_node) >>>>> >>>>> @@ -120,14 +133,16 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq, >>>>> } >>>>> >>>>> /** >>>>> - * drm_sched_rq_select_entity - Select an entity which could provide a job to run >>>>> + * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run >>>>> * >>>>> * @rq: scheduler run queue to check. >>>>> * >>>>> - * Try to find a ready entity, returns NULL if none found. >>>>> + * Try to find a ready entity, in round robin manner. >>>>> + * >>>>> + * Returns NULL if none found. >>>>> */ >>>>> static struct drm_sched_entity * >>>>> -drm_sched_rq_select_entity(struct drm_sched_rq *rq) >>>>> +drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq) >>>>> { >>>>> struct drm_sched_entity *entity; >>>>> >>>>> @@ -163,6 +178,45 @@ drm_sched_rq_select_entity(struct drm_sched_rq *rq) >>>>> return NULL; >>>>> } >>>>> >>>>> +/** >>>>> + * drm_sched_rq_select_entity_fifo - Select an entity which could provide a job to run >>>>> + * >>>>> + * @rq: scheduler run queue to check. >>>>> + * >>>>> + * Try to find a ready entity, based on FIFO order of jobs arrivals. >>>>> + * >>>>> + * Returns NULL if none found. >>>>> + */ >>>>> +static struct drm_sched_entity * >>>>> +drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq) >>>>> +{ >>>>> + struct drm_sched_entity *tmp, *entity = NULL; >>>>> + ktime_t oldest_ts = KTIME_MAX; >>>>> + struct drm_sched_job *sched_job; >>>>> + >>>>> + spin_lock(&rq->lock); >>>>> + >>>>> + list_for_each_entry(tmp, &rq->entities, list) { >>>>> + >>>>> + if (drm_sched_entity_is_ready(tmp)) { >>>>> + sched_job = to_drm_sched_job(spsc_queue_peek(&tmp->job_queue)); >>>>> + >>>>> + if (ktime_before(sched_job->submit_ts, oldest_ts)) { >>>>> + oldest_ts = sched_job->submit_ts; >>>>> + entity = tmp; >>>>> + } >>>>> + } >>>>> + } >>>> Here I think we need an O(1) lookup of the next job to pick out to run. >>>> I see a number of optimizations, for instance keeping the current/oldest >>>> timestamp in the rq struct itself, >>> >>> This was my original design with rb tree based min heap per rq based on >>> time stamp of >>> the oldest job waiting in each entity's job queue (head of entity's SPCP >>> job queue). But how in this >>> case you record the timestamps of all the jobs waiting in entity's the >>> SPCP queue behind the head job ? >>> If you record only the oldest job and more jobs come you have no place >>> to store their timestamps and so >>> on next job select after current HEAD how you will know who came before >>> or after between 2 job queues of >>> of 2 entities ? >>> >>> >>>> or better yet keeping the next job >>>> to pick out to run at a head of list (a la timer wheel implementation). >>>> For suck an optimization to work, you'd prep things up on job insertion, rather >>>> than on job removal/pick to run. >>> >>> I looked at timer wheel and I don't see how this applies here - it >>> assumes you know before >>> job submission your deadline time for job's execution to start - which >>> we don't so I don't see >>> how we can use it. This seems more suitable for real time scheduler >>> implementation where you >>> have a hard requirement to execute a job by some specific time T. In a timer wheel you instantly know the "soonest" job to run--it's naturally your "next" job, regardless of in what order the timers were added and what their timeout time is. >>> >>> I also mentioned a list, obviously there is the brute force solution of >>> just ordering all jobs in one giant list and get >>> naturally a FIFO ordering this way with O(1) insertions and extractions >>> but I assume we don't want one giant jobs queue >>> for all the entities to avoid more dependeies between them (like locking >>> the entire list when one specific entity is killed and >>> needs to remove it's jobs from SW queue). You can also have a list of list pointers. It'd be trivial to remove a whole list from the main list, by simply removing an element--akin to locking out a rq, or should you need to edit the rq's entity list. >>> >>>> I'm also surprised that there is no job transition from one queue to another, >>>> as it is picked to run next--for the above optimizations to implemented, you'd >>>> want a state transition from (state) queue to queue. >>> >>> Not sure what you have in mind here ? How this helps ? I think I've explained this a few times now--each list represents a state and a job/entity travels through lists as it travels through states, which states more or less represent the state of execution in the hardware--it could be as simple as incoming --> pending --> done. It allows a finer grain when resetting the hardware (should the hardware allow it). Note that this isn't directly related to the O(1) mechanism I brought up here. As I said, I was surprised to find out none such distinction existed--that's all. Don't fixate on this. Regards, Luben >>> >>> Andrey >>> >>> >>>> Regards, >>>> Luben >>>> >>> In my origianl design >>> >>>>> + >>>>> + if (entity) { >>>>> + rq->current_entity = entity; >>>>> + reinit_completion(&entity->entity_idle); >>>>> + } >>>>> + >>>>> + spin_unlock(&rq->lock); >>>>> + return entity; >>>>> +} >>>>> + >>>>> /** >>>>> * drm_sched_job_done - complete a job >>>>> * @s_job: pointer to the job which is done >>>>> @@ -804,7 +858,10 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched) >>>>> >>>>> /* Kernel run queue has higher priority than normal run queue*/ >>>>> for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) { >>>>> - entity = drm_sched_rq_select_entity(&sched->sched_rq[i]); >>>>> + entity = drm_sched_policy != 1 ? >>>>> + drm_sched_rq_select_entity_rr(&sched->sched_rq[i]) : >>>>> + drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]); >>>>> + >>>>> if (entity) >>>>> break; >>>>> } >>>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h >>>>> index addb135eeea6..95865881bfcf 100644 >>>>> --- a/include/drm/gpu_scheduler.h >>>>> +++ b/include/drm/gpu_scheduler.h >>>>> @@ -314,6 +314,14 @@ struct drm_sched_job { >>>>> >>>>> /** @last_dependency: tracks @dependencies as they signal */ >>>>> unsigned long last_dependency; >>>>> + >>>>> + /** >>>>> + * @submit_ts: >>>>> + * >>>>> + * Marks job submit time >>>>> + */ >>>>> + ktime_t submit_ts; >>>>> + >>>>> }; >>>>> >>>>> static inline bool drm_sched_invalidate_job(struct drm_sched_job *s_job, >> Regards, Regards, -- Luben