Re: [Regression] drm/scheduler: track GPU active time per entity

Christian König <christian.koenig@xxxxxxx> · Tue, 28 Mar 2023 12:29:44 +0200

The same strategy was suggested before for amdgpu and reverted as well 
because of running into the same problems.

No idea how that slipped, Andrew reviewed it and IIRC he was also the 
one who reverted the initial approach for amdgpu.

Instead of letting the scheduler sum that stuff up proactively we should 
take care of that re-actively when the fdinfo data is queried like 
amdgpu does as well.

We should probably revert that patch for now and then discuss how we can 
do this in a generalized helper.

Regards,
Christian.

Am 28.03.23 um 02:54 schrieb Danilo Krummrich:
Hi all,

Commit df622729ddbf ("drm/scheduler: track GPU active time per 
entity") tries to track the accumulated time that a job was active on 
the GPU writing it to the entity through which the job was deployed to 
the scheduler originally. This is done within 
drm_sched_get_cleanup_job() which fetches a job from the schedulers 
pending_list.

Doing this can result in a race condition where the entity is already 
freed, but the entity's newly added elapsed_ns field is still accessed 
once the job is fetched from the pending_list.

After drm_sched_entity_destroy() being called it should be safe to 
free the structure that embeds the entity. However, a job originally 
handed over to the scheduler by this entity might still reside in the 
schedulers pending_list for cleanup after drm_sched_entity_destroy() 
already being called and the entity being freed. Hence, we can run 
into a UAF.

In my case it happened that a job, as explained above, was just picked 
from the schedulers pending_list after the entity was freed due to the 
client application exiting. Meanwhile this freed up memory was already 
allocated for a subsequent client applications job structure again. 
Hence, the new jobs memory got corrupted. Luckily, I was able to 
reproduce the same corruption over and over again by just using 
deqp-runner to run a specific set of VK test cases in parallel.

Fixing this issue doesn't seem to be very straightforward though 
(unless I miss something), which is why I'm writing this mail instead 
of sending a fix directly.

Spontaneously, I see three options to fix it:

1. Rather than embedding the entity into driver specific structures 
(e.g. tied to file_priv) we could allocate the entity separately and 
reference count it, such that it's only freed up once all jobs that 
were deployed through this entity are fetched from the schedulers 
pending list.

2. Somehow make sure drm_sched_entity_destroy() does block until all 
jobs deployed through this entity were fetched from the schedulers 
pending list. Though, I'm pretty sure that this is not really desirable.

3. Just revert the change and let drivers implement tracking of GPU 
active times themselves.

In the case of just reverting the change I'd propose to also set a 
jobs entity pointer to NULL  once the job was taken from the entity, 
such that in case of a future issue we fail where the actual issue 
resides and to make it more obvious that the field shouldn't be used 
anymore after the job was taken from the entity.

I'm happy to implement the solution we agree on. However, it might 
also make sense to revert the change until we have a solution in 
place. I'm also happy to send a revert with a proper description of 
the problem. Please let me know what you think.

- Danilo