Re: [PATCH] drm/sced: Add FIFO policy for scheduler rq

Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> · Thu, 25 Aug 2022 09:49:27 -0400

On 2022-08-24 22:29, Luben Tuikov wrote:
Inlined:

On 2022-08-24 12:21, Andrey Grodzovsky wrote:
On 2022-08-23 17:37, Luben Tuikov wrote:
On 2022-08-23 14:57, Andrey Grodzovsky wrote:
On 2022-08-23 14:30, Luben Tuikov wrote:

On 2022-08-23 14:13, Andrey Grodzovsky wrote:
On 2022-08-23 12:58, Luben Tuikov wrote:
Inlined:

On 2022-08-22 16:09, Andrey Grodzovsky wrote:
Poblem: Given many entities competing for same rq on
^Problem

same scheduler an uncceptabliy long wait time for some
^unacceptably

jobs waiting stuck in rq before being picked up are
observed (seen using  GPUVis).
The issue is due to Round Robin policy used by scheduler
to pick up the next entity for execution. Under stress
of many entities and long job queus within entity some
^queues

jobs could be stack for very long time in it's entity's
queue before being popped from the queue and executed
while for other entites with samller job queues a job
^entities; smaller

might execute ealier even though that job arrived later
^earlier

then the job in the long queue.

Fix:
Add FIFO selection policy to entites in RQ, chose next enitity
on rq in such order that if job on one entity arrived
ealrier then job on another entity the first job will start
executing ealier regardless of the length of the entity's job
queue.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
Tested-by: Li Yunxiang (Teddy) <Yunxiang.Li@xxxxxxx>
---
     drivers/gpu/drm/scheduler/sched_entity.c |  2 +
     drivers/gpu/drm/scheduler/sched_main.c   | 65 ++++++++++++++++++++++--
     include/drm/gpu_scheduler.h              |  8 +++
     3 files changed, 71 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 6b25b2f4f5a3..3bb7f69306ef 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -507,6 +507,8 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
     	atomic_inc(entity->rq->sched->score);
     	WRITE_ONCE(entity->last_user, current->group_leader);
     	first = spsc_queue_push(&entity->job_queue, &sched_job->queue_node);
+	sched_job->submit_ts = ktime_get();
+
     
     	/* first job wakes up scheduler */
     	if (first) {
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 68317d3a7a27..c123aa120d06 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -59,6 +59,19 @@
     #define CREATE_TRACE_POINTS
     #include "gpu_scheduler_trace.h"
     
+
+
+int drm_sched_policy = -1;
+
+/**
+ * DOC: sched_policy (int)
+ * Used to override default entites scheduling policy in a run queue.
+ */
+MODULE_PARM_DESC(sched_policy,
+		"specify schedule policy for entites on a runqueue (-1 = auto(default) value, 0 = Round Robin,1  = use FIFO");
+module_param_named(sched_policy, drm_sched_policy, int, 0444);
As per Christian's comments, you can drop the "auto" and perhaps leave one as the default,
say the RR.

I do think it is beneficial to have a module parameter control the scheduling policy, as shown above.
Christian is not against it, just against adding 'auto' here - like the
default.
Exactly what I said.

Also, I still think an O(1) scheduling (picking next to run) should be
what we strive for in such a FIFO patch implementation.
A FIFO mechanism is by it's nature an O(1) mechanism for picking the next
element.

Regards,
Luben
The only solution i see for this now is keeping a global per rq jobs
list parallel to SPCP queue per entity - we use this list when we switch
to FIFO scheduling, we can even start building  it ONLY when we switch
to FIFO building it gradually as more jobs come. Do you have other solution
in mind ?
The idea is to "sort" on insertion, not on picking the next one to run.

cont'd below:

Andrey

+
+
     #define to_drm_sched_job(sched_job)		\
     		container_of((sched_job), struct drm_sched_job, queue_node)
     
@@ -120,14 +133,16 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
     }
     
     /**
- * drm_sched_rq_select_entity - Select an entity which could provide a job to run
+ * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
      *
      * @rq: scheduler run queue to check.
      *
- * Try to find a ready entity, returns NULL if none found.
+ * Try to find a ready entity, in round robin manner.
+ *
+ * Returns NULL if none found.
      */
     static struct drm_sched_entity *
-drm_sched_rq_select_entity(struct drm_sched_rq *rq)
+drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
     {
     	struct drm_sched_entity *entity;
     
@@ -163,6 +178,45 @@ drm_sched_rq_select_entity(struct drm_sched_rq *rq)
     	return NULL;
     }
     
+/**
+ * drm_sched_rq_select_entity_fifo - Select an entity which could provide a job to run
+ *
+ * @rq: scheduler run queue to check.
+ *
+ * Try to find a ready entity, based on FIFO order of jobs arrivals.
+ *
+ * Returns NULL if none found.
+ */
+static struct drm_sched_entity *
+drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
+{
+	struct drm_sched_entity *tmp, *entity = NULL;
+	ktime_t oldest_ts = KTIME_MAX;
+	struct drm_sched_job *sched_job;
+
+	spin_lock(&rq->lock);
+
+	list_for_each_entry(tmp, &rq->entities, list) {
+
+		if (drm_sched_entity_is_ready(tmp)) {
+			sched_job = to_drm_sched_job(spsc_queue_peek(&tmp->job_queue));
+
+			if (ktime_before(sched_job->submit_ts, oldest_ts)) {
+				oldest_ts = sched_job->submit_ts;
+				entity = tmp;
+			}
+		}
+	}
Here I think we need an O(1) lookup of the next job to pick out to run.
I see a number of optimizations, for instance keeping the current/oldest
timestamp in the rq struct itself,
This was my original design with rb tree based min heap per rq based on
time stamp of
the oldest job waiting in each entity's job queue (head of entity's SPCP
job queue). But how in this
case you record the timestamps of all the jobs waiting in entity's the
SPCP queue behind the head job ?
If you record only the oldest job and more jobs come you have no place
to store their timestamps and so
on next job select after current HEAD how you will know who came before
or after between 2 job queues of
of 2 entities ?


or better yet keeping the next job
to pick out to run at a head of list (a la timer wheel implementation).
For suck an optimization to work, you'd prep things up on job insertion, rather
than on job removal/pick to run.
I looked at timer wheel and I don't see how this applies here - it
assumes you know before
job submission your deadline time for job's execution to start - which
we don't so I don't see
how we can use it. This seems more suitable for real time scheduler
implementation where you
have a hard requirement to execute a job by some specific time T.
In a timer wheel you instantly know the "soonest" job to run--it's naturally
your "next" job, regardless of in what order the timers were added and what
their timeout time is.

I also mentioned a list, obviously there is the brute force solution of
just ordering all jobs in one giant list and get
naturally a FIFO ordering this way with O(1) insertions and extractions
but I assume we don't want one giant jobs queue
for all the entities to avoid more dependeies between them (like locking
the entire list when one specific entity is killed and
needs to remove it's jobs from SW queue).
You can also have a list of list pointers. It'd be trivial to remove a whole
list from the main list, by simply removing an element--akin to locking out a rq,
or should you need to edit the rq's entity list.

So you do mean some kind of FIFO list. I really would want to avoid
maintaining an
extra data structure, we already have jobs stored in entity SPSC queue,
and now we will have to add
to a job struct  a pointer to another,  rq wide FIFO list, seems to me
like a recipe for problems.
Maybe.

Thinking more about it, if I do let each job have it's original
submission timestamp stored I can go back to my original design
of storing sched entities themself in min heap structure based on
timestamp of the next job to run in the entity
(head of job list), this way we get O(1) extraction of next job to run
and it will still be FIFO. The cost will be O(log(# entites in rq))
for updating the min heap on each job extraction based on next head
timestamp. Still better then linear. On the downside we need to maintain
an rb tree structure
O(1) would be ideal, and the "sorting" can be had when the jobs are submitted,
which could take O(log N), as in sorting.

But O(log N) on a balanced tree, such as you're using an R-B tree for a min heap,
is a great metric and way faster than O(N).

to store the entitles in parallel to holding them in linear list for
round robin as it's tpday.
You could choose to store one way or another, exclusively, depending on what
scheduling policy had been chosen.

I am attaching my original patch to give sense of
it with TODO section were I added this new code. There are actually some
corner cases with empty SPCP queue becoming full and sampling not the
head of the
queue but the next one is what we need but i think it's solvable.
Your WIP patch looks good.

In general it seems to me that before doing more complicated design we
actually need to measure and see if there is really a substantial
performance hit compared to current RR
or even  compared to possible O(1) extraction solution. No point to
complicate design if we don't get significant performance improvement
from it.
WLOG, assuming all jobs are ready to be executed, then, for graphics, I'd imagine
we'd want to pick the next job to run in O(1) time complexity. However O(log N) is
still good and preferable. O(N) is slow. FWIW, the RR we have right now is O(1) given
the job readiness assumption above.


You convinced it, going to try to rework my original patch. I asked 
Yunxiang to measure
performance delta anyway, just in case this rework get stuck and then we 
will just use
the primitive linear search for the client needs at least.

Andrey




I'm also surprised that there is no job transition from one queue to another,
as it is picked to run next--for the above optimizations to implemented, you'd
want a state transition from (state) queue to queue.
Not sure what you have in mind here ? How this helps ?
I think I've explained this a few times now--each list represents a state and a job/entity
travels through lists as it travels through states, which states more or less represent
the state of execution in the hardware--it could be as simple as incoming --> pending --> done.

It allows a finer grain when resetting the hardware (should the hardware allow it).

Note that this isn't directly related to the O(1) mechanism I brought up here. As I said, I was surprised
to find out none such distinction existed--that's all. Don't fixate on this.

Regards,
Luben

Right, so this one is a separate  topic regarding the pending job list
refactoring which we need to discuss and fix separately - i have TODO to
come back
to your patches from a year ago or so which were addressing this.
Probably will try to get to this after finishing this work.
Excellent!

Regards,
Luben

Andrey


Andrey


Regards,
Luben

In my origianl design

+
+	if (entity) {
+		rq->current_entity = entity;
+		reinit_completion(&entity->entity_idle);
+	}
+
+	spin_unlock(&rq->lock);
+	return entity;
+}
+
     /**
      * drm_sched_job_done - complete a job
      * @s_job: pointer to the job which is done
@@ -804,7 +858,10 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
     
     	/* Kernel run queue has higher priority than normal run queue*/
     	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
-		entity = drm_sched_rq_select_entity(&sched->sched_rq[i]);
+		entity = drm_sched_policy != 1 ?
+				drm_sched_rq_select_entity_rr(&sched->sched_rq[i]) :
+				drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]);
+
     		if (entity)
     			break;
     	}
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index addb135eeea6..95865881bfcf 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -314,6 +314,14 @@ struct drm_sched_job {
     
     	/** @last_dependency: tracks @dependencies as they signal */
     	unsigned long			last_dependency;
+
+       /**
+	* @submit_ts:
+	*
+	* Marks job submit time
+	*/
+       ktime_t				submit_ts;
+
     };
     
     static inline bool drm_sched_invalidate_job(struct drm_sched_job *s_job,
Regards,
Regards,