On Thu, Mar 23, 2023 at 2:30 PM Rob Clark <robdclark@xxxxxxxxx> wrote: > > On Thu, Mar 23, 2023 at 7:03 AM Christian König > <christian.koenig@xxxxxxx> wrote: > > > > Am 23.03.23 um 14:54 schrieb Rob Clark: > > > On Thu, Mar 23, 2023 at 12:35 AM Christian König > > > <christian.koenig@xxxxxxx> wrote: > > >> Am 22.03.23 um 23:44 schrieb Rob Clark: > > >>> From: Rob Clark <robdclark@xxxxxxxxxxxx> > > >>> > > >>> Container fences have burner contexts, which makes the trick to store at > > >>> most one fence per context somewhat useless if we don't unwrap array or > > >>> chain fences. > > >> Mhm, we intentionally kept them not unwrapped since this way they only > > >> occupy one fence slot. > > >> > > >> But it might be better to unwrap them if you add many of those dependencies. > > >> > > >>> Signed-off-by: Rob Clark <robdclark@xxxxxxxxxxxx> > > >>> --- > > >>> tbh, I'm not sure why we weren't doing this already, unless there is > > >>> something I'm overlooking > > >>> > > >>> drivers/gpu/drm/scheduler/sched_main.c | 43 +++++++++++++++++--------- > > >>> 1 file changed, 28 insertions(+), 15 deletions(-) > > >>> > > >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > > >>> index c2ee44d6224b..f59e5335afbb 100644 > > >>> --- a/drivers/gpu/drm/scheduler/sched_main.c > > >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c > > >>> @@ -41,20 +41,21 @@ > > >>> * 4. Entities themselves maintain a queue of jobs that will be scheduled on > > >>> * the hardware. > > >>> * > > >>> * The jobs in a entity are always scheduled in the order that they were pushed. > > >>> */ > > >>> > > >>> #include <linux/kthread.h> > > >>> #include <linux/wait.h> > > >>> #include <linux/sched.h> > > >>> #include <linux/completion.h> > > >>> +#include <linux/dma-fence-unwrap.h> > > >>> #include <linux/dma-resv.h> > > >>> #include <uapi/linux/sched/types.h> > > >>> > > >>> #include <drm/drm_print.h> > > >>> #include <drm/drm_gem.h> > > >>> #include <drm/gpu_scheduler.h> > > >>> #include <drm/spsc_queue.h> > > >>> > > >>> #define CREATE_TRACE_POINTS > > >>> #include "gpu_scheduler_trace.h" > > >>> @@ -665,41 +666,27 @@ void drm_sched_job_arm(struct drm_sched_job *job) > > >>> sched = entity->rq->sched; > > >>> > > >>> job->sched = sched; > > >>> job->s_priority = entity->rq - sched->sched_rq; > > >>> job->id = atomic64_inc_return(&sched->job_id_count); > > >>> > > >>> drm_sched_fence_init(job->s_fence, job->entity); > > >>> } > > >>> EXPORT_SYMBOL(drm_sched_job_arm); > > >>> > > >>> -/** > > >>> - * drm_sched_job_add_dependency - adds the fence as a job dependency > > >>> - * @job: scheduler job to add the dependencies to > > >>> - * @fence: the dma_fence to add to the list of dependencies. > > >>> - * > > >>> - * Note that @fence is consumed in both the success and error cases. > > >>> - * > > >>> - * Returns: > > >>> - * 0 on success, or an error on failing to expand the array. > > >>> - */ > > >>> -int drm_sched_job_add_dependency(struct drm_sched_job *job, > > >>> - struct dma_fence *fence) > > >>> +static int _add_dependency(struct drm_sched_job *job, struct dma_fence *fence) > > >> Please keep the drm_sched_job_ prefix here even for static functions. > > >> The symbol _add_dependency just sucks in a backtrace, especially when > > >> it's tail optimized. > > >> > > >>> { > > >>> struct dma_fence *entry; > > >>> unsigned long index; > > >>> u32 id = 0; > > >>> int ret; > > >>> > > >>> - if (!fence) > > >>> - return 0; > > >>> - > > >>> /* Deduplicate if we already depend on a fence from the same context. > > >>> * This lets the size of the array of deps scale with the number of > > >>> * engines involved, rather than the number of BOs. > > >>> */ > > >>> xa_for_each(&job->dependencies, index, entry) { > > >>> if (entry->context != fence->context) > > >>> continue; > > >>> > > >>> if (dma_fence_is_later(fence, entry)) { > > >>> dma_fence_put(entry); > > >>> @@ -709,20 +696,46 @@ int drm_sched_job_add_dependency(struct drm_sched_job *job, > > >>> } > > >>> return 0; > > >>> } > > >>> > > >>> ret = xa_alloc(&job->dependencies, &id, fence, xa_limit_32b, GFP_KERNEL); > > >>> if (ret != 0) > > >>> dma_fence_put(fence); > > >>> > > >>> return ret; > > >>> } > > >>> + > > >>> +/** > > >>> + * drm_sched_job_add_dependency - adds the fence as a job dependency > > >>> + * @job: scheduler job to add the dependencies to > > >>> + * @fence: the dma_fence to add to the list of dependencies. > > >>> + * > > >>> + * Note that @fence is consumed in both the success and error cases. > > >>> + * > > >>> + * Returns: > > >>> + * 0 on success, or an error on failing to expand the array. > > >>> + */ > > >>> +int drm_sched_job_add_dependency(struct drm_sched_job *job, > > >>> + struct dma_fence *fence) > > >> Maybe name the new function drm_sched_job_unwrap_add_dependency or > > >> something like this. > > >> > > >> I need to double check, but I think for some cases we don't need or > > >> don't even want this in the driver. > > > I'd be curious to know the cases where you don't want this.. one thing > > > I was thinking about, what if you have a container fence with two > > > contained fences. One is on the same ctx as the job, one is not but > > > signals sooner. You end up artificially waiting on both, which seems > > > sub-optimal. > > > > Well resv objects don't contain other containers for example. > > I suppose I have the explicit sync case more in mind, where the > dependent fence ends up being a chain or array (if userspace is > merging fence fd's). > > > Then we also have an use case in amdgpu where fence need to be > > explicitly waited for even when they are from the same ctx as the job > > because otherwise we wouldn't see everything cache coherent. > > This was the kinda weird case I wanted to make sure I wasn't breaking. > I remember seeing something fly by for this, but can't find it now or > remember what amdgpu's solution was.. > > > On the other hand we currently handle that amdgpu use case differently > > and the extra overhead of unwrapping fences even if they can't be > > containers is probably negligible. > > > > > Anyways, I can make this a new entrypoint which unwraps, and/or rename > > > the internal static function, if we think this is a good idea. > > > > If you think that's unnecessary keeping your original approach is fine > > with me as well. > > I'm going to assume unnecessary until someone speaks up with their > weird specific case ;-) So, this patch turns out to blow up spectacularly with dma_fence refcnt underflows when I enable DRIVER_SYNCOBJ_TIMELINE .. I think, because it starts unwrapping fence chains, possibly in parallel with fence signaling on the retire path. Is it supposed to be permissible to unwrap a fence chain concurrently? BR, -R > BR, > -R > > > Regards, > > Christian. > > > > > > > > BR, > > > -R > > > > > >> Christian. > > >> > > >>> +{ > > >>> + struct dma_fence_unwrap iter; > > >>> + struct dma_fence *f; > > >>> + int ret = 0; > > >>> + > > >>> + dma_fence_unwrap_for_each (f, &iter, fence) { > > >>> + ret = _add_dependency(job, f); > > >>> + if (ret) > > >>> + break; > > >>> + } > > >>> + > > >>> + return ret; > > >>> +} > > >>> EXPORT_SYMBOL(drm_sched_job_add_dependency); > > >>> > > >>> /** > > >>> * drm_sched_job_add_resv_dependencies - add all fences from the resv to the job > > >>> * @job: scheduler job to add the dependencies to > > >>> * @resv: the dma_resv object to get the fences from > > >>> * @usage: the dma_resv_usage to use to filter the fences > > >>> * > > >>> * This adds all fences matching the given usage from @resv to @job. > > >>> * Must be called with the @resv lock held. > >