Re: [RFC] drm/scheduler: Unwrap job dependencies

Rob Clark <robdclark@xxxxxxxxx> · Mon, 4 Dec 2023 13:54:00 -0800

On Thu, Mar 23, 2023 at 2:30 PM Rob Clark <robdclark@xxxxxxxxx> wrote:
>
> On Thu, Mar 23, 2023 at 7:03 AM Christian König
> <christian.koenig@xxxxxxx> wrote:
> >
> > Am 23.03.23 um 14:54 schrieb Rob Clark:
> > > On Thu, Mar 23, 2023 at 12:35 AM Christian König
> > > <christian.koenig@xxxxxxx> wrote:
> > >> Am 22.03.23 um 23:44 schrieb Rob Clark:
> > >>> From: Rob Clark <robdclark@xxxxxxxxxxxx>
> > >>>
> > >>> Container fences have burner contexts, which makes the trick to store at
> > >>> most one fence per context somewhat useless if we don't unwrap array or
> > >>> chain fences.
> > >> Mhm, we intentionally kept them not unwrapped since this way they only
> > >> occupy one fence slot.
> > >>
> > >> But it might be better to unwrap them if you add many of those dependencies.
> > >>
> > >>> Signed-off-by: Rob Clark <robdclark@xxxxxxxxxxxx>
> > >>> ---
> > >>> tbh, I'm not sure why we weren't doing this already, unless there is
> > >>> something I'm overlooking
> > >>>
> > >>>    drivers/gpu/drm/scheduler/sched_main.c | 43 +++++++++++++++++---------
> > >>>    1 file changed, 28 insertions(+), 15 deletions(-)
> > >>>
> > >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > >>> index c2ee44d6224b..f59e5335afbb 100644
> > >>> --- a/drivers/gpu/drm/scheduler/sched_main.c
> > >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > >>> @@ -41,20 +41,21 @@
> > >>>     * 4. Entities themselves maintain a queue of jobs that will be scheduled on
> > >>>     *    the hardware.
> > >>>     *
> > >>>     * The jobs in a entity are always scheduled in the order that they were pushed.
> > >>>     */
> > >>>
> > >>>    #include <linux/kthread.h>
> > >>>    #include <linux/wait.h>
> > >>>    #include <linux/sched.h>
> > >>>    #include <linux/completion.h>
> > >>> +#include <linux/dma-fence-unwrap.h>
> > >>>    #include <linux/dma-resv.h>
> > >>>    #include <uapi/linux/sched/types.h>
> > >>>
> > >>>    #include <drm/drm_print.h>
> > >>>    #include <drm/drm_gem.h>
> > >>>    #include <drm/gpu_scheduler.h>
> > >>>    #include <drm/spsc_queue.h>
> > >>>
> > >>>    #define CREATE_TRACE_POINTS
> > >>>    #include "gpu_scheduler_trace.h"
> > >>> @@ -665,41 +666,27 @@ void drm_sched_job_arm(struct drm_sched_job *job)
> > >>>        sched = entity->rq->sched;
> > >>>
> > >>>        job->sched = sched;
> > >>>        job->s_priority = entity->rq - sched->sched_rq;
> > >>>        job->id = atomic64_inc_return(&sched->job_id_count);
> > >>>
> > >>>        drm_sched_fence_init(job->s_fence, job->entity);
> > >>>    }
> > >>>    EXPORT_SYMBOL(drm_sched_job_arm);
> > >>>
> > >>> -/**
> > >>> - * drm_sched_job_add_dependency - adds the fence as a job dependency
> > >>> - * @job: scheduler job to add the dependencies to
> > >>> - * @fence: the dma_fence to add to the list of dependencies.
> > >>> - *
> > >>> - * Note that @fence is consumed in both the success and error cases.
> > >>> - *
> > >>> - * Returns:
> > >>> - * 0 on success, or an error on failing to expand the array.
> > >>> - */
> > >>> -int drm_sched_job_add_dependency(struct drm_sched_job *job,
> > >>> -                              struct dma_fence *fence)
> > >>> +static int _add_dependency(struct drm_sched_job *job, struct dma_fence *fence)
> > >> Please keep the drm_sched_job_ prefix here even for static functions.
> > >> The symbol _add_dependency just sucks in a backtrace, especially when
> > >> it's tail optimized.
> > >>
> > >>>    {
> > >>>        struct dma_fence *entry;
> > >>>        unsigned long index;
> > >>>        u32 id = 0;
> > >>>        int ret;
> > >>>
> > >>> -     if (!fence)
> > >>> -             return 0;
> > >>> -
> > >>>        /* Deduplicate if we already depend on a fence from the same context.
> > >>>         * This lets the size of the array of deps scale with the number of
> > >>>         * engines involved, rather than the number of BOs.
> > >>>         */
> > >>>        xa_for_each(&job->dependencies, index, entry) {
> > >>>                if (entry->context != fence->context)
> > >>>                        continue;
> > >>>
> > >>>                if (dma_fence_is_later(fence, entry)) {
> > >>>                        dma_fence_put(entry);
> > >>> @@ -709,20 +696,46 @@ int drm_sched_job_add_dependency(struct drm_sched_job *job,
> > >>>                }
> > >>>                return 0;
> > >>>        }
> > >>>
> > >>>        ret = xa_alloc(&job->dependencies, &id, fence, xa_limit_32b, GFP_KERNEL);
> > >>>        if (ret != 0)
> > >>>                dma_fence_put(fence);
> > >>>
> > >>>        return ret;
> > >>>    }
> > >>> +
> > >>> +/**
> > >>> + * drm_sched_job_add_dependency - adds the fence as a job dependency
> > >>> + * @job: scheduler job to add the dependencies to
> > >>> + * @fence: the dma_fence to add to the list of dependencies.
> > >>> + *
> > >>> + * Note that @fence is consumed in both the success and error cases.
> > >>> + *
> > >>> + * Returns:
> > >>> + * 0 on success, or an error on failing to expand the array.
> > >>> + */
> > >>> +int drm_sched_job_add_dependency(struct drm_sched_job *job,
> > >>> +                              struct dma_fence *fence)
> > >> Maybe name the new function drm_sched_job_unwrap_add_dependency or
> > >> something like this.
> > >>
> > >> I need to double check, but I think for some cases we don't need or
> > >> don't even want this in the driver.
> > > I'd be curious to know the cases where you don't want this.. one thing
> > > I was thinking about, what if you have a container fence with two
> > > contained fences.  One is on the same ctx as the job, one is not but
> > > signals sooner.  You end up artificially waiting on both, which seems
> > > sub-optimal.
> >
> > Well resv objects don't contain other containers for example.
>
> I suppose I have the explicit sync case more in mind, where the
> dependent fence ends up being a chain or array (if userspace is
> merging fence fd's).
>
> > Then we also have an use case in amdgpu where fence need to be
> > explicitly waited for even when they are from the same ctx as the job
> > because otherwise we wouldn't see everything cache coherent.
>
> This was the kinda weird case I wanted to make sure I wasn't breaking.
> I remember seeing something fly by for this, but can't find it now or
> remember what amdgpu's solution was..
>
> > On the other hand we currently handle that amdgpu use case differently
> > and the extra overhead of unwrapping fences even if they can't be
> > containers is probably negligible.
> >
> > > Anyways, I can make this a new entrypoint which unwraps, and/or rename
> > > the internal static function, if we think this is a good idea.
> >
> > If you think that's unnecessary keeping your original approach is fine
> > with me as well.
>
> I'm going to assume unnecessary until someone speaks up with their
> weird specific case ;-)

So, this patch turns out to blow up spectacularly with dma_fence
refcnt underflows when I enable DRIVER_SYNCOBJ_TIMELINE .. I think,
because it starts unwrapping fence chains, possibly in parallel with
fence signaling on the retire path.  Is it supposed to be permissible
to unwrap a fence chain concurrently?

BR,
-R

> BR,
> -R
>
> > Regards,
> > Christian.
> >
> > >
> > > BR,
> > > -R
> > >
> > >> Christian.
> > >>
> > >>> +{
> > >>> +     struct dma_fence_unwrap iter;
> > >>> +     struct dma_fence *f;
> > >>> +     int ret = 0;
> > >>> +
> > >>> +     dma_fence_unwrap_for_each (f, &iter, fence) {
> > >>> +             ret = _add_dependency(job, f);
> > >>> +             if (ret)
> > >>> +                     break;
> > >>> +     }
> > >>> +
> > >>> +     return ret;
> > >>> +}
> > >>>    EXPORT_SYMBOL(drm_sched_job_add_dependency);
> > >>>
> > >>>    /**
> > >>>     * drm_sched_job_add_resv_dependencies - add all fences from the resv to the job
> > >>>     * @job: scheduler job to add the dependencies to
> > >>>     * @resv: the dma_resv object to get the fences from
> > >>>     * @usage: the dma_resv_usage to use to filter the fences
> > >>>     *
> > >>>     * This adds all fences matching the given usage from @resv to @job.
> > >>>     * Must be called with the @resv lock held.
> >