Luben, just a ping, whenever you have time.
Andrey
On 2022-09-05 01:57, Christian König wrote:
Am 03.09.22 um 04:48 schrieb Andrey Grodzovsky:
Poblem: Given many entities competing for same rq on
same scheduler an uncceptabliy long wait time for some
jobs waiting stuck in rq before being picked up are
observed (seen using GPUVis).
The issue is due to Round Robin policy used by scheduler
to pick up the next entity for execution. Under stress
of many entities and long job queus within entity some
jobs could be stack for very long time in it's entity's
queue before being popped from the queue and executed
while for other entites with samller job queues a job
might execute ealier even though that job arrived later
then the job in the long queue.
Fix:
Add FIFO selection policy to entites in RQ, chose next enitity
on rq in such order that if job on one entity arrived
ealrier then job on another entity the first job will start
executing ealier regardless of the length of the entity's job
queue.
v2:
Switch to rb tree structure for entites based on TS of
oldest job waiting in job queue of enitity. Improves next
enitity extraction to O(1). Enitity TS update
O(log(number of entites in rq))
Drop default option in module control parameter.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
Tested-by: Li Yunxiang (Teddy) <Yunxiang.Li@xxxxxxx>
[SNIP]
/**
@@ -313,6 +330,14 @@ struct drm_sched_job {
/** @last_dependency: tracks @dependencies as they signal */
unsigned long last_dependency;
+
+
+ /**
+ * @submit_ts:
+ *
+ * Marks job submit time
Maybe write something like "When the job was pushed into the entity
queue."
Apart from that I leave it to Luben and you to get this stuff upstream.
Thanks,
Christian.
+ */
+ ktime_t submit_ts;
};
static inline bool drm_sched_invalidate_job(struct drm_sched_job
*s_job,
@@ -501,6 +526,10 @@ void drm_sched_rq_add_entity(struct drm_sched_rq
*rq,
void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
struct drm_sched_entity *entity);
+void drm_sched_rq_update_fifo(struct drm_sched_entity *entity,
ktime_t ts,
+ bool remove_only);
+
+
int drm_sched_entity_init(struct drm_sched_entity *entity,
enum drm_sched_priority priority,
struct drm_gpu_scheduler **sched_list,