Re: [RFC] drm/scheduler: Unwrap job dependencies

Rob Clark <robdclark@xxxxxxxxx> · Thu, 23 Mar 2023 14:30:24 -0700

On Thu, Mar 23, 2023 at 7:03 AM Christian König
<christian.koenig@xxxxxxx> wrote:
>
> Am 23.03.23 um 14:54 schrieb Rob Clark:
> > On Thu, Mar 23, 2023 at 12:35 AM Christian König
> > <christian.koenig@xxxxxxx> wrote:
> >> Am 22.03.23 um 23:44 schrieb Rob Clark:
> >>> From: Rob Clark <robdclark@xxxxxxxxxxxx>
> >>>
> >>> Container fences have burner contexts, which makes the trick to store at
> >>> most one fence per context somewhat useless if we don't unwrap array or
> >>> chain fences.
> >> Mhm, we intentionally kept them not unwrapped since this way they only
> >> occupy one fence slot.
> >>
> >> But it might be better to unwrap them if you add many of those dependencies.
> >>
> >>> Signed-off-by: Rob Clark <robdclark@xxxxxxxxxxxx>
> >>> ---
> >>> tbh, I'm not sure why we weren't doing this already, unless there is
> >>> something I'm overlooking
> >>>
> >>>    drivers/gpu/drm/scheduler/sched_main.c | 43 +++++++++++++++++---------
> >>>    1 file changed, 28 insertions(+), 15 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> >>> index c2ee44d6224b..f59e5335afbb 100644
> >>> --- a/drivers/gpu/drm/scheduler/sched_main.c
> >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> >>> @@ -41,20 +41,21 @@
> >>>     * 4. Entities themselves maintain a queue of jobs that will be scheduled on
> >>>     *    the hardware.
> >>>     *
> >>>     * The jobs in a entity are always scheduled in the order that they were pushed.
> >>>     */
> >>>
> >>>    #include <linux/kthread.h>
> >>>    #include <linux/wait.h>
> >>>    #include <linux/sched.h>
> >>>    #include <linux/completion.h>
> >>> +#include <linux/dma-fence-unwrap.h>
> >>>    #include <linux/dma-resv.h>
> >>>    #include <uapi/linux/sched/types.h>
> >>>
> >>>    #include <drm/drm_print.h>
> >>>    #include <drm/drm_gem.h>
> >>>    #include <drm/gpu_scheduler.h>
> >>>    #include <drm/spsc_queue.h>
> >>>
> >>>    #define CREATE_TRACE_POINTS
> >>>    #include "gpu_scheduler_trace.h"
> >>> @@ -665,41 +666,27 @@ void drm_sched_job_arm(struct drm_sched_job *job)
> >>>        sched = entity->rq->sched;
> >>>
> >>>        job->sched = sched;
> >>>        job->s_priority = entity->rq - sched->sched_rq;
> >>>        job->id = atomic64_inc_return(&sched->job_id_count);
> >>>
> >>>        drm_sched_fence_init(job->s_fence, job->entity);
> >>>    }
> >>>    EXPORT_SYMBOL(drm_sched_job_arm);
> >>>
> >>> -/**
> >>> - * drm_sched_job_add_dependency - adds the fence as a job dependency
> >>> - * @job: scheduler job to add the dependencies to
> >>> - * @fence: the dma_fence to add to the list of dependencies.
> >>> - *
> >>> - * Note that @fence is consumed in both the success and error cases.
> >>> - *
> >>> - * Returns:
> >>> - * 0 on success, or an error on failing to expand the array.
> >>> - */
> >>> -int drm_sched_job_add_dependency(struct drm_sched_job *job,
> >>> -                              struct dma_fence *fence)
> >>> +static int _add_dependency(struct drm_sched_job *job, struct dma_fence *fence)
> >> Please keep the drm_sched_job_ prefix here even for static functions.
> >> The symbol _add_dependency just sucks in a backtrace, especially when
> >> it's tail optimized.
> >>
> >>>    {
> >>>        struct dma_fence *entry;
> >>>        unsigned long index;
> >>>        u32 id = 0;
> >>>        int ret;
> >>>
> >>> -     if (!fence)
> >>> -             return 0;
> >>> -
> >>>        /* Deduplicate if we already depend on a fence from the same context.
> >>>         * This lets the size of the array of deps scale with the number of
> >>>         * engines involved, rather than the number of BOs.
> >>>         */
> >>>        xa_for_each(&job->dependencies, index, entry) {
> >>>                if (entry->context != fence->context)
> >>>                        continue;
> >>>
> >>>                if (dma_fence_is_later(fence, entry)) {
> >>>                        dma_fence_put(entry);
> >>> @@ -709,20 +696,46 @@ int drm_sched_job_add_dependency(struct drm_sched_job *job,
> >>>                }
> >>>                return 0;
> >>>        }
> >>>
> >>>        ret = xa_alloc(&job->dependencies, &id, fence, xa_limit_32b, GFP_KERNEL);
> >>>        if (ret != 0)
> >>>                dma_fence_put(fence);
> >>>
> >>>        return ret;
> >>>    }
> >>> +
> >>> +/**
> >>> + * drm_sched_job_add_dependency - adds the fence as a job dependency
> >>> + * @job: scheduler job to add the dependencies to
> >>> + * @fence: the dma_fence to add to the list of dependencies.
> >>> + *
> >>> + * Note that @fence is consumed in both the success and error cases.
> >>> + *
> >>> + * Returns:
> >>> + * 0 on success, or an error on failing to expand the array.
> >>> + */
> >>> +int drm_sched_job_add_dependency(struct drm_sched_job *job,
> >>> +                              struct dma_fence *fence)
> >> Maybe name the new function drm_sched_job_unwrap_add_dependency or
> >> something like this.
> >>
> >> I need to double check, but I think for some cases we don't need or
> >> don't even want this in the driver.
> > I'd be curious to know the cases where you don't want this.. one thing
> > I was thinking about, what if you have a container fence with two
> > contained fences.  One is on the same ctx as the job, one is not but
> > signals sooner.  You end up artificially waiting on both, which seems
> > sub-optimal.
>
> Well resv objects don't contain other containers for example.

I suppose I have the explicit sync case more in mind, where the
dependent fence ends up being a chain or array (if userspace is
merging fence fd's).

> Then we also have an use case in amdgpu where fence need to be
> explicitly waited for even when they are from the same ctx as the job
> because otherwise we wouldn't see everything cache coherent.

This was the kinda weird case I wanted to make sure I wasn't breaking.
I remember seeing something fly by for this, but can't find it now or
remember what amdgpu's solution was..

> On the other hand we currently handle that amdgpu use case differently
> and the extra overhead of unwrapping fences even if they can't be
> containers is probably negligible.
>
> > Anyways, I can make this a new entrypoint which unwraps, and/or rename
> > the internal static function, if we think this is a good idea.
>
> If you think that's unnecessary keeping your original approach is fine
> with me as well.

I'm going to assume unnecessary until someone speaks up with their
weird specific case ;-)

BR,
-R

> Regards,
> Christian.
>
> >
> > BR,
> > -R
> >
> >> Christian.
> >>
> >>> +{
> >>> +     struct dma_fence_unwrap iter;
> >>> +     struct dma_fence *f;
> >>> +     int ret = 0;
> >>> +
> >>> +     dma_fence_unwrap_for_each (f, &iter, fence) {
> >>> +             ret = _add_dependency(job, f);
> >>> +             if (ret)
> >>> +                     break;
> >>> +     }
> >>> +
> >>> +     return ret;
> >>> +}
> >>>    EXPORT_SYMBOL(drm_sched_job_add_dependency);
> >>>
> >>>    /**
> >>>     * drm_sched_job_add_resv_dependencies - add all fences from the resv to the job
> >>>     * @job: scheduler job to add the dependencies to
> >>>     * @resv: the dma_resv object to get the fences from
> >>>     * @usage: the dma_resv_usage to use to filter the fences
> >>>     *
> >>>     * This adds all fences matching the given usage from @resv to @job.
> >>>     * Must be called with the @resv lock held.
>