Re: [PATCH] drm/sced: Add FIFO policy for scheduler rq

Luben Tuikov <luben.tuikov@xxxxxxx> · Tue, 23 Aug 2022 17:37:48 -0400

On 2022-08-23 14:57, Andrey Grodzovsky wrote:
> On 2022-08-23 14:30, Luben Tuikov wrote:
> 
>>
>> On 2022-08-23 14:13, Andrey Grodzovsky wrote:
>>> On 2022-08-23 12:58, Luben Tuikov wrote:
>>>> Inlined:
>>>>
>>>> On 2022-08-22 16:09, Andrey Grodzovsky wrote:
>>>>> Poblem: Given many entities competing for same rq on
>>>> ^Problem
>>>>
>>>>> same scheduler an uncceptabliy long wait time for some
>>>> ^unacceptably
>>>>
>>>>> jobs waiting stuck in rq before being picked up are
>>>>> observed (seen using  GPUVis).
>>>>> The issue is due to Round Robin policy used by scheduler
>>>>> to pick up the next entity for execution. Under stress
>>>>> of many entities and long job queus within entity some
>>>> ^queues
>>>>
>>>>> jobs could be stack for very long time in it's entity's
>>>>> queue before being popped from the queue and executed
>>>>> while for other entites with samller job queues a job
>>>> ^entities; smaller
>>>>
>>>>> might execute ealier even though that job arrived later
>>>> ^earlier
>>>>
>>>>> then the job in the long queue.
>>>>>
>>>>> Fix:
>>>>> Add FIFO selection policy to entites in RQ, chose next enitity
>>>>> on rq in such order that if job on one entity arrived
>>>>> ealrier then job on another entity the first job will start
>>>>> executing ealier regardless of the length of the entity's job
>>>>> queue.
>>>>>
>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
>>>>> Tested-by: Li Yunxiang (Teddy) <Yunxiang.Li@xxxxxxx>
>>>>> ---
>>>>>    drivers/gpu/drm/scheduler/sched_entity.c |  2 +
>>>>>    drivers/gpu/drm/scheduler/sched_main.c   | 65 ++++++++++++++++++++++--
>>>>>    include/drm/gpu_scheduler.h              |  8 +++
>>>>>    3 files changed, 71 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> index 6b25b2f4f5a3..3bb7f69306ef 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> @@ -507,6 +507,8 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>>>>>    	atomic_inc(entity->rq->sched->score);
>>>>>    	WRITE_ONCE(entity->last_user, current->group_leader);
>>>>>    	first = spsc_queue_push(&entity->job_queue, &sched_job->queue_node);
>>>>> +	sched_job->submit_ts = ktime_get();
>>>>> +
>>>>>    
>>>>>    	/* first job wakes up scheduler */
>>>>>    	if (first) {
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> index 68317d3a7a27..c123aa120d06 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> @@ -59,6 +59,19 @@
>>>>>    #define CREATE_TRACE_POINTS
>>>>>    #include "gpu_scheduler_trace.h"
>>>>>    
>>>>> +
>>>>> +
>>>>> +int drm_sched_policy = -1;
>>>>> +
>>>>> +/**
>>>>> + * DOC: sched_policy (int)
>>>>> + * Used to override default entites scheduling policy in a run queue.
>>>>> + */
>>>>> +MODULE_PARM_DESC(sched_policy,
>>>>> +		"specify schedule policy for entites on a runqueue (-1 = auto(default) value, 0 = Round Robin,1  = use FIFO");
>>>>> +module_param_named(sched_policy, drm_sched_policy, int, 0444);
>>>> As per Christian's comments, you can drop the "auto" and perhaps leave one as the default,
>>>> say the RR.
>>>>
>>>> I do think it is beneficial to have a module parameter control the scheduling policy, as shown above.
>>>
>>> Christian is not against it, just against adding 'auto' here - like the
>>> default.
>> Exactly what I said.
>>
>> Also, I still think an O(1) scheduling (picking next to run) should be
>> what we strive for in such a FIFO patch implementation.
>> A FIFO mechanism is by it's nature an O(1) mechanism for picking the next
>> element.
>>
>> Regards,
>> Luben
> 
> 
> The only solution i see for this now is keeping a global per rq jobs 
> list parallel to SPCP queue per entity - we use this list when we switch
> to FIFO scheduling, we can even start building  it ONLY when we switch 
> to FIFO building it gradually as more jobs come. Do you have other solution
> in mind ?

The idea is to "sort" on insertion, not on picking the next one to run.

cont'd below:

> 
> Andrey
> 
>>
>>>
>>>>> +
>>>>> +
>>>>>    #define to_drm_sched_job(sched_job)		\
>>>>>    		container_of((sched_job), struct drm_sched_job, queue_node)
>>>>>    
>>>>> @@ -120,14 +133,16 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>>>>>    }
>>>>>    
>>>>>    /**
>>>>> - * drm_sched_rq_select_entity - Select an entity which could provide a job to run
>>>>> + * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
>>>>>     *
>>>>>     * @rq: scheduler run queue to check.
>>>>>     *
>>>>> - * Try to find a ready entity, returns NULL if none found.
>>>>> + * Try to find a ready entity, in round robin manner.
>>>>> + *
>>>>> + * Returns NULL if none found.
>>>>>     */
>>>>>    static struct drm_sched_entity *
>>>>> -drm_sched_rq_select_entity(struct drm_sched_rq *rq)
>>>>> +drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>>>    {
>>>>>    	struct drm_sched_entity *entity;
>>>>>    
>>>>> @@ -163,6 +178,45 @@ drm_sched_rq_select_entity(struct drm_sched_rq *rq)
>>>>>    	return NULL;
>>>>>    }
>>>>>    
>>>>> +/**
>>>>> + * drm_sched_rq_select_entity_fifo - Select an entity which could provide a job to run
>>>>> + *
>>>>> + * @rq: scheduler run queue to check.
>>>>> + *
>>>>> + * Try to find a ready entity, based on FIFO order of jobs arrivals.
>>>>> + *
>>>>> + * Returns NULL if none found.
>>>>> + */
>>>>> +static struct drm_sched_entity *
>>>>> +drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>>>> +{
>>>>> +	struct drm_sched_entity *tmp, *entity = NULL;
>>>>> +	ktime_t oldest_ts = KTIME_MAX;
>>>>> +	struct drm_sched_job *sched_job;
>>>>> +
>>>>> +	spin_lock(&rq->lock);
>>>>> +
>>>>> +	list_for_each_entry(tmp, &rq->entities, list) {
>>>>> +
>>>>> +		if (drm_sched_entity_is_ready(tmp)) {
>>>>> +			sched_job = to_drm_sched_job(spsc_queue_peek(&tmp->job_queue));
>>>>> +
>>>>> +			if (ktime_before(sched_job->submit_ts, oldest_ts)) {
>>>>> +				oldest_ts = sched_job->submit_ts;
>>>>> +				entity = tmp;
>>>>> +			}
>>>>> +		}
>>>>> +	}
>>>> Here I think we need an O(1) lookup of the next job to pick out to run.
>>>> I see a number of optimizations, for instance keeping the current/oldest
>>>> timestamp in the rq struct itself,
>>>
>>> This was my original design with rb tree based min heap per rq based on
>>> time stamp of
>>> the oldest job waiting in each entity's job queue (head of entity's SPCP
>>> job queue). But how in this
>>> case you record the timestamps of all the jobs waiting in entity's the
>>> SPCP queue behind the head job ?
>>> If you record only the oldest job and more jobs come you have no place
>>> to store their timestamps and so
>>> on next job select after current HEAD how you will know who came before
>>> or after between 2 job queues of
>>> of 2 entities ?
>>>
>>>
>>>> or better yet keeping the next job
>>>> to pick out to run at a head of list (a la timer wheel implementation).
>>>> For suck an optimization to work, you'd prep things up on job insertion, rather
>>>> than on job removal/pick to run.
>>>
>>> I looked at timer wheel and I don't see how this applies here - it
>>> assumes you know before
>>> job submission your deadline time for job's execution to start - which
>>> we don't so I don't see
>>> how we can use it. This seems more suitable for real time scheduler
>>> implementation where you
>>> have a hard requirement to execute a job by some specific time T.

In a timer wheel you instantly know the "soonest" job to run--it's naturally
your "next" job, regardless of in what order the timers were added and what
their timeout time is.

>>>
>>> I also mentioned a list, obviously there is the brute force solution of
>>> just ordering all jobs in one giant list and get
>>> naturally a FIFO ordering this way with O(1) insertions and extractions
>>> but I assume we don't want one giant jobs queue
>>> for all the entities to avoid more dependeies between them (like locking
>>> the entire list when one specific entity is killed and
>>> needs to remove it's jobs from SW queue).

You can also have a list of list pointers. It'd be trivial to remove a whole
list from the main list, by simply removing an element--akin to locking out a rq,
or should you need to edit the rq's entity list.

>>>
>>>> I'm also surprised that there is no job transition from one queue to another,
>>>> as it is picked to run next--for the above optimizations to implemented, you'd
>>>> want a state transition from (state) queue to queue.
>>>
>>> Not sure what you have in mind here ? How this helps ?

I think I've explained this a few times now--each list represents a state and a job/entity
travels through lists as it travels through states, which states more or less represent
the state of execution in the hardware--it could be as simple as incoming --> pending --> done.

It allows a finer grain when resetting the hardware (should the hardware allow it).

Note that this isn't directly related to the O(1) mechanism I brought up here. As I said, I was surprised
to find out none such distinction existed--that's all. Don't fixate on this.

Regards,
Luben

>>>
>>> Andrey
>>>
>>>
>>>> Regards,
>>>> Luben
>>>>
>>> In my origianl design
>>>
>>>>> +
>>>>> +	if (entity) {
>>>>> +		rq->current_entity = entity;
>>>>> +		reinit_completion(&entity->entity_idle);
>>>>> +	}
>>>>> +
>>>>> +	spin_unlock(&rq->lock);
>>>>> +	return entity;
>>>>> +}
>>>>> +
>>>>>    /**
>>>>>     * drm_sched_job_done - complete a job
>>>>>     * @s_job: pointer to the job which is done
>>>>> @@ -804,7 +858,10 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>>>>>    
>>>>>    	/* Kernel run queue has higher priority than normal run queue*/
>>>>>    	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>>>>> -		entity = drm_sched_rq_select_entity(&sched->sched_rq[i]);
>>>>> +		entity = drm_sched_policy != 1 ?
>>>>> +				drm_sched_rq_select_entity_rr(&sched->sched_rq[i]) :
>>>>> +				drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]);
>>>>> +
>>>>>    		if (entity)
>>>>>    			break;
>>>>>    	}
>>>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>>>> index addb135eeea6..95865881bfcf 100644
>>>>> --- a/include/drm/gpu_scheduler.h
>>>>> +++ b/include/drm/gpu_scheduler.h
>>>>> @@ -314,6 +314,14 @@ struct drm_sched_job {
>>>>>    
>>>>>    	/** @last_dependency: tracks @dependencies as they signal */
>>>>>    	unsigned long			last_dependency;
>>>>> +
>>>>> +       /**
>>>>> +	* @submit_ts:
>>>>> +	*
>>>>> +	* Marks job submit time
>>>>> +	*/
>>>>> +       ktime_t				submit_ts;
>>>>> +
>>>>>    };
>>>>>    
>>>>>    static inline bool drm_sched_invalidate_job(struct drm_sched_job *s_job,
>> Regards,

Regards,
-- 
Luben