On 2023-06-23 04:03, Boris Brezillon wrote: > On Fri, 23 Jun 2023 09:52:04 +0200 > Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> wrote: > >> Drivers that can delegate waits to the firmware/GPU pass the scheduled >> fence to drm_sched_job_add_dependency(), and issue wait commands to >> the firmware/GPU at job submission time. For this to be possible, they >> need all their 'native' dependencies to have a valid parent since this >> is where the actual HW fence information are encoded. >> >> In drm_sched_main(), we currently call drm_sched_fence_set_parent() >> after drm_sched_fence_scheduled(), leaving a short period of time >> during which the job depending on this fence can be submitted. >> >> Since setting parent and signaling the fence are two things that are >> kinda related (you can't have a parent if the job hasn't been scheduled), >> it probably makes sense to pass the parent fence to >> drm_sched_fence_scheduled() and let it call drm_sched_fence_set_parent() >> before it signals the scheduled fence. >> >> Here is a detailed description of the race we are fixing here: >> >> Thread A Thread B >> >> - calls drm_sched_fence_scheduled() >> - signals s_fence->scheduled which >> wakes up thread B >> >> - entity dep signaled, checking >> the next dep >> - no more deps waiting >> - entity is picked for job >> submission by drm_gpu_scheduler >> - run_job() is called >> - run_job() tries to >> collect native fence info from >> s_fence->parent, but it's >> NULL => >> BOOM, we can't do our native >> wait >> >> - calls drm_sched_fence_set_parent() >> >> v2: >> * Fix commit message >> >> v3: >> * Add a detailed description of the race to the commit message >> * Add Luben's R-b >> > > FYI, I didn't put a Fixes tag because the various moves/modifications > that happened on this file will make it hard to backport anyway, and no > one complained about it so far. But if we want to have one, it would > probably be: > > Fixes: 754ce0fa55c4 ("drm/amd: add parent for sched fence") > I agree with your assessment--the race fix doesn't seem to be pointing to or introduced by one particular change. Plus that fixes change is from 2016... So, we're good to go as is. -- Regards, Luben