On 2022-08-24 04:29, Michel Dänzer wrote:
On 2022-08-22 22:09, Andrey Grodzovsky wrote:
Poblem: Given many entities competing for same rq on
same scheduler an uncceptabliy long wait time for some
jobs waiting stuck in rq before being picked up are
observed (seen using GPUVis).
The issue is due to Round Robin policy used by scheduler
to pick up the next entity for execution. Under stress
of many entities and long job queus within entity some
jobs could be stack for very long time in it's entity's
queue before being popped from the queue and executed
while for other entites with samller job queues a job
might execute ealier even though that job arrived later
then the job in the long queue.
Fix:
Add FIFO selection policy to entites in RQ, chose next enitity
on rq in such order that if job on one entity arrived
ealrier then job on another entity the first job will start
executing ealier regardless of the length of the entity's job
queue.
Instead of ordering based on when jobs are added, might it be possible to order them based on when they become ready to run?
Otherwise it seems possible to e.g. submit a large number of inter-dependent jobs at once, and they would all run before any jobs from another queue get a chance.
While any of them is not ready (i.e. still having unfulfilled
dependency) this job will not be chosen to run (see
drm_sched_entity_is_ready). In this scenario if an earlier job
from entity E1 is not ready to run it will be skipped and a later job
from entity E2 (which is ready) will be chosen to run so E1 job is not
blocking E2 job. The moment E1 job
does become ready it seems to me logical to let it run ASAP as it's by
now it spent the most time of anyone waiting for execution, and I don't
think it matters that part of this time
was because it waited for dependency job to complete it's run.
Andrey