On Fri, 23 Jun 2023 14:37:57 -0400 Luben Tuikov <luben.tuikov@xxxxxxx> wrote: > On 2023-06-23 04:03, Boris Brezillon wrote: > > On Fri, 23 Jun 2023 09:52:04 +0200 > > Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> wrote: > > > >> Drivers that can delegate waits to the firmware/GPU pass the scheduled > >> fence to drm_sched_job_add_dependency(), and issue wait commands to > >> the firmware/GPU at job submission time. For this to be possible, they > >> need all their 'native' dependencies to have a valid parent since this > >> is where the actual HW fence information are encoded. > >> > >> In drm_sched_main(), we currently call drm_sched_fence_set_parent() > >> after drm_sched_fence_scheduled(), leaving a short period of time > >> during which the job depending on this fence can be submitted. > >> > >> Since setting parent and signaling the fence are two things that are > >> kinda related (you can't have a parent if the job hasn't been scheduled), > >> it probably makes sense to pass the parent fence to > >> drm_sched_fence_scheduled() and let it call drm_sched_fence_set_parent() > >> before it signals the scheduled fence. > >> > >> Here is a detailed description of the race we are fixing here: > >> > >> Thread A Thread B > >> > >> - calls drm_sched_fence_scheduled() > >> - signals s_fence->scheduled which > >> wakes up thread B > >> > >> - entity dep signaled, checking > >> the next dep > >> - no more deps waiting > >> - entity is picked for job > >> submission by drm_gpu_scheduler > >> - run_job() is called > >> - run_job() tries to > >> collect native fence info from > >> s_fence->parent, but it's > >> NULL => > >> BOOM, we can't do our native > >> wait > >> > >> - calls drm_sched_fence_set_parent() > >> > >> v2: > >> * Fix commit message > >> > >> v3: > >> * Add a detailed description of the race to the commit message > >> * Add Luben's R-b > >> > > > > FYI, I didn't put a Fixes tag because the various moves/modifications > > that happened on this file will make it hard to backport anyway, and no > > one complained about it so far. But if we want to have one, it would > > probably be: > > > > Fixes: 754ce0fa55c4 ("drm/amd: add parent for sched fence") > > > > I agree with your assessment--the race fix doesn't seem to be pointing to > or introduced by one particular change. Plus that fixes change is from 2016... > So, we're good to go as is. Queued to drm-misc-fixes.