On Thu, 6 Apr 2023 at 10:22, Christian König <christian.koenig@xxxxxxx> wrote: > > Am 05.04.23 um 18:09 schrieb Luben Tuikov: > > On 2023-04-05 10:05, Danilo Krummrich wrote: > >> On 4/4/23 06:31, Luben Tuikov wrote: > >>> On 2023-03-28 04:54, Lucas Stach wrote: > >>>> Hi Danilo, > >>>> > >>>> Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich: > >>>>> Hi all, > >>>>> > >>>>> Commit df622729ddbf ("drm/scheduler: track GPU active time per entity") > >>>>> tries to track the accumulated time that a job was active on the GPU > >>>>> writing it to the entity through which the job was deployed to the > >>>>> scheduler originally. This is done within drm_sched_get_cleanup_job() > >>>>> which fetches a job from the schedulers pending_list. > >>>>> > >>>>> Doing this can result in a race condition where the entity is already > >>>>> freed, but the entity's newly added elapsed_ns field is still accessed > >>>>> once the job is fetched from the pending_list. > >>>>> > >>>>> After drm_sched_entity_destroy() being called it should be safe to free > >>>>> the structure that embeds the entity. However, a job originally handed > >>>>> over to the scheduler by this entity might still reside in the > >>>>> schedulers pending_list for cleanup after drm_sched_entity_destroy() > >>>>> already being called and the entity being freed. Hence, we can run into > >>>>> a UAF. > >>>>> > >>>> Sorry about that, I clearly didn't properly consider this case. > >>>> > >>>>> In my case it happened that a job, as explained above, was just picked > >>>>> from the schedulers pending_list after the entity was freed due to the > >>>>> client application exiting. Meanwhile this freed up memory was already > >>>>> allocated for a subsequent client applications job structure again. > >>>>> Hence, the new jobs memory got corrupted. Luckily, I was able to > >>>>> reproduce the same corruption over and over again by just using > >>>>> deqp-runner to run a specific set of VK test cases in parallel. > >>>>> > >>>>> Fixing this issue doesn't seem to be very straightforward though (unless > >>>>> I miss something), which is why I'm writing this mail instead of sending > >>>>> a fix directly. > >>>>> > >>>>> Spontaneously, I see three options to fix it: > >>>>> > >>>>> 1. Rather than embedding the entity into driver specific structures > >>>>> (e.g. tied to file_priv) we could allocate the entity separately and > >>>>> reference count it, such that it's only freed up once all jobs that were > >>>>> deployed through this entity are fetched from the schedulers pending list. > >>>>> > >>>> My vote is on this or something in similar vain for the long term. I > >>>> have some hope to be able to add a GPU scheduling algorithm with a bit > >>>> more fairness than the current one sometime in the future, which > >>>> requires execution time tracking on the entities. > >>> Danilo, > >>> > >>> Using kref is preferable, i.e. option 1 above. > >> I think the only real motivation for doing that would be for generically > >> tracking job statistics within the entity a job was deployed through. If > >> we all agree on tracking job statistics this way I am happy to prepare a > >> patch for this option and drop this one: > >> https://lore.kernel.org/all/20230331000622.4156-1-dakr@xxxxxxxxxx/T/#u > > Hmm, I never thought about "job statistics" when I preferred using kref above. > > The reason kref is attractive is because one doesn't need to worry about > > it--when the last user drops the kref, the release is called to do > > housekeeping. If this never happens, we know that we have a bug to debug. > > Yeah, reference counting unfortunately have some traps as well. For > example rarely dropping the last reference from interrupt context or > with some unexpected locks help when the cleanup function doesn't expect > that is a good recipe for problems as well. > > > Regarding the patch above--I did look around the code, and it seems safe, > > as per your analysis, I didn't see any reference to entity after job submission, > > but I'll comment on that thread as well for the record. > > Reference counting the entities was suggested before. The intentionally > avoided that so far because the entity might be the tip of the iceberg > of stuff you need to keep around. > > For example for command submission you also need the VM and when you > keep the VM alive you also need to keep the file private alive.... Yeah refcounting looks often like the easy way out to avoid use-after-free issue, until you realize you've just made lifetimes unbounded and have some enourmous leaks: entity keeps vm alive, vm keeps all the bo alives, somehow every crash wastes more memory because vk_device_lost means userspace allocates new stuff for everything. If possible a lifetime design where lifetimes have hard bounds and you just borrow a reference under a lock (or some other ownership rule) is generally much more solid. But also much harder to design correctly :-/ > Additional to that we have some ugly inter dependencies between tearing > down an application (potential with a KILL signal from the OOM killer) > and backward compatibility for some applications which render something > and quit before the rendering is completed in the hardware. Yeah I think that part would also be good to sort out once&for all in drm/sched, because i915 has/had the same struggle. -Daniel > > Regards, > Christian. > > > > > Regards, > > Luben > > > >> Christian mentioned amdgpu tried something similar to what Lucas tried > >> running into similar trouble, backed off and implemented it in another > >> way - a driver specific way I guess? > >> > >>> Lucas, can you shed some light on, > >>> > >>> 1. In what way the current FIFO scheduling is unfair, and > >>> 2. shed some details on this "scheduling algorithm with a bit > >>> more fairness than the current one"? > >>> > >>> Regards, > >>> Luben > >>> > >>>>> 2. Somehow make sure drm_sched_entity_destroy() does block until all > >>>>> jobs deployed through this entity were fetched from the schedulers > >>>>> pending list. Though, I'm pretty sure that this is not really desirable. > >>>>> > >>>>> 3. Just revert the change and let drivers implement tracking of GPU > >>>>> active times themselves. > >>>>> > >>>> Given that we are already pretty late in the release cycle and etnaviv > >>>> being the only driver so far making use of the scheduler elapsed time > >>>> tracking I think the right short term solution is to either move the > >>>> tracking into etnaviv or just revert the change for now. I'll have a > >>>> look at this. > >>>> > >>>> Regards, > >>>> Lucas > >>>> > >>>>> In the case of just reverting the change I'd propose to also set a jobs > >>>>> entity pointer to NULL once the job was taken from the entity, such > >>>>> that in case of a future issue we fail where the actual issue resides > >>>>> and to make it more obvious that the field shouldn't be used anymore > >>>>> after the job was taken from the entity. > >>>>> > >>>>> I'm happy to implement the solution we agree on. However, it might also > >>>>> make sense to revert the change until we have a solution in place. I'm > >>>>> also happy to send a revert with a proper description of the problem. > >>>>> Please let me know what you think. > >>>>> > >>>>> - Danilo > >>>>> > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch