Re: [Regression] drm/scheduler: track GPU active time per entity

Lucas Stach <l.stach@xxxxxxxxxxxxxx> · Wed, 05 Apr 2023 19:44:34 +0200

Hi Luben,

Am Dienstag, dem 04.04.2023 um 00:31 -0400 schrieb Luben Tuikov:
> On 2023-03-28 04:54, Lucas Stach wrote:
> > Hi Danilo,
> > 
> > Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
> > > Hi all,
> > > 
> > > Commit df622729ddbf ("drm/scheduler: track GPU active time per entity") 
> > > tries to track the accumulated time that a job was active on the GPU 
> > > writing it to the entity through which the job was deployed to the 
> > > scheduler originally. This is done within drm_sched_get_cleanup_job() 
> > > which fetches a job from the schedulers pending_list.
> > > 
> > > Doing this can result in a race condition where the entity is already 
> > > freed, but the entity's newly added elapsed_ns field is still accessed 
> > > once the job is fetched from the pending_list.
> > > 
> > > After drm_sched_entity_destroy() being called it should be safe to free 
> > > the structure that embeds the entity. However, a job originally handed 
> > > over to the scheduler by this entity might still reside in the 
> > > schedulers pending_list for cleanup after drm_sched_entity_destroy() 
> > > already being called and the entity being freed. Hence, we can run into 
> > > a UAF.
> > > 
> > Sorry about that, I clearly didn't properly consider this case.
> > 
> > > In my case it happened that a job, as explained above, was just picked 
> > > from the schedulers pending_list after the entity was freed due to the 
> > > client application exiting. Meanwhile this freed up memory was already 
> > > allocated for a subsequent client applications job structure again. 
> > > Hence, the new jobs memory got corrupted. Luckily, I was able to 
> > > reproduce the same corruption over and over again by just using 
> > > deqp-runner to run a specific set of VK test cases in parallel.
> > > 
> > > Fixing this issue doesn't seem to be very straightforward though (unless 
> > > I miss something), which is why I'm writing this mail instead of sending 
> > > a fix directly.
> > > 
> > > Spontaneously, I see three options to fix it:
> > > 
> > > 1. Rather than embedding the entity into driver specific structures 
> > > (e.g. tied to file_priv) we could allocate the entity separately and 
> > > reference count it, such that it's only freed up once all jobs that were 
> > > deployed through this entity are fetched from the schedulers pending list.
> > > 
> > My vote is on this or something in similar vain for the long term. I
> > have some hope to be able to add a GPU scheduling algorithm with a bit
> > more fairness than the current one sometime in the future, which
> > requires execution time tracking on the entities.
> 
> Danilo,
> 
> Using kref is preferable, i.e. option 1 above.
> 
> Lucas, can you shed some light on,
> 
> 1. In what way the current FIFO scheduling is unfair, and
> 2. shed some details on this "scheduling algorithm with a bit
> more fairness than the current one"? 

I don't have a specific implementation in mind yet. However the current
FIFO algorithm can be very unfair if you have a sparse workload compete
with one that generates a lot of jobs without any throttling aside from
the entity queue length. By tracking the actual GPU time consumed by
the entities we could implement something with a bit more fairness like
deficit round robin (don't pin me on the specific algorithm, as I
haven't given it much thought yet).

Regards,
Lucas