On 2023-04-05 10:05, Danilo Krummrich wrote: > On 4/4/23 06:31, Luben Tuikov wrote: >> On 2023-03-28 04:54, Lucas Stach wrote: >>> Hi Danilo, >>> >>> Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich: >>>> Hi all, >>>> >>>> Commit df622729ddbf ("drm/scheduler: track GPU active time per entity") >>>> tries to track the accumulated time that a job was active on the GPU >>>> writing it to the entity through which the job was deployed to the >>>> scheduler originally. This is done within drm_sched_get_cleanup_job() >>>> which fetches a job from the schedulers pending_list. >>>> >>>> Doing this can result in a race condition where the entity is already >>>> freed, but the entity's newly added elapsed_ns field is still accessed >>>> once the job is fetched from the pending_list. >>>> >>>> After drm_sched_entity_destroy() being called it should be safe to free >>>> the structure that embeds the entity. However, a job originally handed >>>> over to the scheduler by this entity might still reside in the >>>> schedulers pending_list for cleanup after drm_sched_entity_destroy() >>>> already being called and the entity being freed. Hence, we can run into >>>> a UAF. >>>> >>> Sorry about that, I clearly didn't properly consider this case. >>> >>>> In my case it happened that a job, as explained above, was just picked >>>> from the schedulers pending_list after the entity was freed due to the >>>> client application exiting. Meanwhile this freed up memory was already >>>> allocated for a subsequent client applications job structure again. >>>> Hence, the new jobs memory got corrupted. Luckily, I was able to >>>> reproduce the same corruption over and over again by just using >>>> deqp-runner to run a specific set of VK test cases in parallel. >>>> >>>> Fixing this issue doesn't seem to be very straightforward though (unless >>>> I miss something), which is why I'm writing this mail instead of sending >>>> a fix directly. >>>> >>>> Spontaneously, I see three options to fix it: >>>> >>>> 1. Rather than embedding the entity into driver specific structures >>>> (e.g. tied to file_priv) we could allocate the entity separately and >>>> reference count it, such that it's only freed up once all jobs that were >>>> deployed through this entity are fetched from the schedulers pending list. >>>> >>> My vote is on this or something in similar vain for the long term. I >>> have some hope to be able to add a GPU scheduling algorithm with a bit >>> more fairness than the current one sometime in the future, which >>> requires execution time tracking on the entities. >> >> Danilo, >> >> Using kref is preferable, i.e. option 1 above. > > I think the only real motivation for doing that would be for generically > tracking job statistics within the entity a job was deployed through. If > we all agree on tracking job statistics this way I am happy to prepare a > patch for this option and drop this one: > https://lore.kernel.org/all/20230331000622.4156-1-dakr@xxxxxxxxxx/T/#u Hmm, I never thought about "job statistics" when I preferred using kref above. The reason kref is attractive is because one doesn't need to worry about it--when the last user drops the kref, the release is called to do housekeeping. If this never happens, we know that we have a bug to debug. Regarding the patch above--I did look around the code, and it seems safe, as per your analysis, I didn't see any reference to entity after job submission, but I'll comment on that thread as well for the record. Regards, Luben > > Christian mentioned amdgpu tried something similar to what Lucas tried > running into similar trouble, backed off and implemented it in another > way - a driver specific way I guess? > >> >> Lucas, can you shed some light on, >> >> 1. In what way the current FIFO scheduling is unfair, and >> 2. shed some details on this "scheduling algorithm with a bit >> more fairness than the current one"? >> >> Regards, >> Luben >> >>> >>>> 2. Somehow make sure drm_sched_entity_destroy() does block until all >>>> jobs deployed through this entity were fetched from the schedulers >>>> pending list. Though, I'm pretty sure that this is not really desirable. >>>> >>>> 3. Just revert the change and let drivers implement tracking of GPU >>>> active times themselves. >>>> >>> Given that we are already pretty late in the release cycle and etnaviv >>> being the only driver so far making use of the scheduler elapsed time >>> tracking I think the right short term solution is to either move the >>> tracking into etnaviv or just revert the change for now. I'll have a >>> look at this. >>> >>> Regards, >>> Lucas >>> >>>> In the case of just reverting the change I'd propose to also set a jobs >>>> entity pointer to NULL once the job was taken from the entity, such >>>> that in case of a future issue we fail where the actual issue resides >>>> and to make it more obvious that the field shouldn't be used anymore >>>> after the job was taken from the entity. >>>> >>>> I'm happy to implement the solution we agree on. However, it might also >>>> make sense to revert the change until we have a solution in place. I'm >>>> also happy to send a revert with a proper description of the problem. >>>> Please let me know what you think. >>>> >>>> - Danilo >>>> >>> >> >