Re: [RFC] drm/msm: Add UABI to request perfcntr usage

Rob Clark <robdclark@xxxxxxxxx> · Mon, 16 Dec 2024 09:16:25 -0800

On Mon, Dec 16, 2024 at 8:59 AM Connor Abbott <cwabbott0@xxxxxxxxx> wrote:
>
> On Mon, Dec 16, 2024 at 11:55 AM Akhil P Oommen
> <quic_akhilpo@xxxxxxxxxxx> wrote:
> >
> > On 12/13/2024 10:40 PM, Antonino Maniscalco wrote:
> > > On 12/13/24 5:50 PM, Akhil P Oommen wrote:
> > >> On 12/12/2024 9:44 PM, Antonino Maniscalco wrote:
> > >>> On 12/12/24 4:58 PM, Akhil P Oommen wrote:
> > >>>> On 12/5/2024 10:24 PM, Rob Clark wrote:
> > >>>>> From: Rob Clark <robdclark@xxxxxxxxxxxx>
> > >>>>>
> > >>>>> Performance counter usage falls into two categories:
> > >>>>>
> > >>>>> 1. Local usage, where the counter configuration, start, and end read
> > >>>>>      happen within (locally to) a single SUBMIT.  In this case,
> > >>>>> there is
> > >>>>>      no dependency on counter configuration or values between submits,
> > >>>>> and
> > >>>>>      in fact counters are normally cleared on context switches,
> > >>>>> making it
> > >>>>>      impossible to rely on cross-submit state.
> > >>>>>
> > >>>>> 2. Global usage, where a single privilaged daemon/process is sampling
> > >>>>>      counter values across all processes for profiling.
> > >>>>>
> > >>>>> The two categories are mutually exclusive.  While you can have many
> > >>>>> processes making local counter usage, you cannot combine global and
> > >>>>> local usage without the two stepping on each others feet (by changing
> > >>>>> counter configuration).
> > >>>>>
> > >>>>> For global counter usage, there is already a SYSPROF param (since
> > >>>>> global
> > >>>>> counter usage requires disabling counter clearing on context switch).
> > >>>>> This patch adds a REQ_CNTRS param to request local counter usage.  If
> > >>>>> one or more processes has requested counter usage, then a SYSPROF
> > >>>>> request will fail with -EBUSY.  And if SYSPROF is active, then
> > >>>>> REQ_CNTRS
> > >>>>> will fail with -EBUSY, maintaining the mutual exclusivity.
> > >>>>>
> > >>>>> This is purely an advisory interface to help coordinate userspace.
> > >>>>> There is no real means of enforcement, but the worst that can
> > >>>>> happen if
> > >>>>> userspace ignores a REQ_CNTRS failure is that you'll get nonsense
> > >>>>> profiling data.
> > >>>>>
> > >>>>> Signed-off-by: Rob Clark <robdclark@xxxxxxxxxxxx>
> > >>>>> ---
> > >>>>> kgsl takes a different approach, which involves a lot more UABI for
> > >>>>> assigning counters to different processes.  But I think by taking
> > >>>>> advantage of the fact that mesa (freedreno+turnip) reconfigure the
> > >>>>> counters they need in each SUBMIT, for their respective gl/vk perf-
> > >>>>> counter extensions, we can take this simpler approach.
> > >>>>
> > >>>> KGSL's approach is preemption and ifpc safe (also whatever HW changes
> > >>>> that will come up in future generations). How will we ensure that here?
> > >>>>
> > >>>> I have plans to bring up IFPC support in near future. Also, I
> > >>>> brought up
> > >>>> this point during preemption series. But from the responses, I felt
> > >>>> that
> > >>>> profiling was not considered a serious usecase. Still I wonder how the
> > >>>> perfcounter extensions work accurately with preemption.
> > >>>
> > >>> So back then I implemented the postamble IB to clear perf counters and
> > >>> that gets disabled when sysprof (so global usage) is happening. The
> > >>> kernel is oblivious to "Local isage" of profiling but in that case
> > >>> really what we want to do is disable preemption which in my
> > >>> understanding can be done from userspace with a PKT. In my understanding
> > >>> this had us covered for all usecases.
> > >>
> > >> I think this wasn't mentioned at that time. Which UMD PKT in a6x+ did
> > >> you mean?
> > >
> > > Ah, I thought it wasmentioned, sorry.
> > > The packet I was referring to is:
> > >     <doc> Make next dword 1 to disable preemption, 0 to re-enable it. </
> > > doc>
> > >     <value name="CP_PREEMPT_DISABLE" value="0x6c" variants="A6XX"/>
> >
> > Ah! Okay. I think this packet is not used by the downstream blob. IMO,
> > disabling preemption is still a suboptimal solution.
>
> Downstream doesn't expose userspace perfcounters (i.e.
> VK_KHR_performance_query) at all. My understanding is that Android
> requires you not to expose them for security reasons, but I could be
> wrong there. In any case, Qualcomm clearly hasn't really thought
> through what it would take to make everything work well with userspace
> perfcounters and hasn't implemented the necessary firmware bits for
> that, so we're left muddling through and doing what we can.

That is correct, VK_KHR_performance_query is disallowed on android.
There is an android CTS test for that.

BR,
-R

>
> Connor
>
> >
> > >
> > > BTW you mentioned wanting to look into IFPC. Since I too wanted to look
> > > into implementing it wonder if you could let me know when you planned on
> > > working on it.
> >
> > I have few patches in progress. Nothing final yet and need verification
> > on the hw side. Also, I need to do some housekeeping here to debug gmu
> > issues since IFPC increases the probability of those a lot.
> >
> > I will try to send out the patches very soon.
> >
> > -Akhil.
> >
> > >
> > >>
> > >> -Akhil.
> > >>
> > >>>
> > >>> So what would you expect instead we should do kernel side to make
> > >>> profiling preemption safe?
> > >>>
> > >>>>
> > >>>> -Akhil
> > >>>>
> > >>>>>
> > >>>>>    drivers/gpu/drm/msm/adreno/adreno_gpu.c |  2 +
> > >>>>>    drivers/gpu/drm/msm/msm_drv.c           |  5 ++-
> > >>>>>    drivers/gpu/drm/msm/msm_gpu.c           |  1 +
> > >>>>>    drivers/gpu/drm/msm/msm_gpu.h           | 29 +++++++++++++-
> > >>>>>    drivers/gpu/drm/msm/msm_submitqueue.c   | 52 +++++++++++++++++++
> > >>>>> +++++-
> > >>>>>    include/uapi/drm/msm_drm.h              |  1 +
> > >>>>>    6 files changed, 85 insertions(+), 5 deletions(-)
> > >>>>>
> > >>>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/
> > >>>>> drm/msm/adreno/adreno_gpu.c
> > >>>>> index 31bbf2c83de4..f688e37059b8 100644
> > >>>>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > >>>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > >>>>> @@ -441,6 +441,8 @@ int adreno_set_param(struct msm_gpu *gpu, struct
> > >>>>> msm_file_private *ctx,
> > >>>>>            if (!capable(CAP_SYS_ADMIN))
> > >>>>>                return UERR(EPERM, drm, "invalid permissions");
> > >>>>>            return msm_file_private_set_sysprof(ctx, gpu, value);
> > >>>>> +    case MSM_PARAM_REQ_CNTRS:
> > >>>>> +        return msm_file_private_request_counters(ctx, gpu, value);
> > >>>>>        default:
> > >>>>>            return UERR(EINVAL, drm, "%s: invalid param: %u", gpu-
> > >>>>>> name, param);
> > >>>>>        }
> > >>>>> diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/
> > >>>>> msm_drv.c
> > >>>>> index 6416d2cb4efc..bf8314ff4a25 100644
> > >>>>> --- a/drivers/gpu/drm/msm/msm_drv.c
> > >>>>> +++ b/drivers/gpu/drm/msm/msm_drv.c
> > >>>>> @@ -377,9 +377,12 @@ static void msm_postclose(struct drm_device
> > >>>>> *dev, struct drm_file *file)
> > >>>>>         * It is not possible to set sysprof param to non-zero if gpu
> > >>>>>         * is not initialized:
> > >>>>>         */
> > >>>>> -    if (priv->gpu)
> > >>>>> +    if (ctx->sysprof)
> > >>>>>            msm_file_private_set_sysprof(ctx, priv->gpu, 0);
> > >>>>>    +    if (ctx->counters_requested)
> > >>>>> +        msm_file_private_request_counters(ctx, priv->gpu, 0);
> > >>>>> +
> > >>>>>        context_close(ctx);
> > >>>>>    }
> > >>>>>    diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/
> > >>>>> msm_gpu.c
> > >>>>> index 82f204f3bb8f..013b59ca3bb1 100644
> > >>>>> --- a/drivers/gpu/drm/msm/msm_gpu.c
> > >>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.c
> > >>>>> @@ -991,6 +991,7 @@ int msm_gpu_init(struct drm_device *drm, struct
> > >>>>> platform_device *pdev,
> > >>>>>        gpu->nr_rings = nr_rings;
> > >>>>>          refcount_set(&gpu->sysprof_active, 1);
> > >>>>> +    refcount_set(&gpu->local_counters_active, 1);
> > >>>>>          return 0;
> > >>>>>    diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/
> > >>>>> msm_gpu.h
> > >>>>> index e25009150579..83c61e523b1b 100644
> > >>>>> --- a/drivers/gpu/drm/msm/msm_gpu.h
> > >>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.h
> > >>>>> @@ -195,12 +195,28 @@ struct msm_gpu {
> > >>>>>        int nr_rings;
> > >>>>>          /**
> > >>>>> -     * sysprof_active:
> > >>>>> +     * @sysprof_active:
> > >>>>>         *
> > >>>>> -     * The count of contexts that have enabled system profiling.
> > >>>>> +     * The count of contexts that have enabled system profiling plus
> > >>>>> one.
> > >>>>> +     *
> > >>>>> +     * Note: refcount_t does not like 0->1 transitions.. we want to
> > >>>>> keep
> > >>>>> +     * the under/overflow checks that refcount_t provides, but allow
> > >>>>> +     * multiple on/off transitions so we track the logical value
> > >>>>> plus one.)
> > >>>>>         */
> > >>>>>        refcount_t sysprof_active;
> > >>>>>    +    /**
> > >>>>> +     * @local_counters_active:
> > >>>>> +     *
> > >>>>> +     * The count of contexts that have requested local (intra-submit)
> > >>>>> +     * performance counter usage plus one.
> > >>>>> +     *
> > >>>>> +     * Note: refcount_t does not like 0->1 transitions.. we want to
> > >>>>> keep
> > >>>>> +     * the under/overflow checks that refcount_t provides, but allow
> > >>>>> +     * multiple on/off transitions so we track the logical value
> > >>>>> plus one.)
> > >>>>> +     */
> > >>>>> +    refcount_t local_counters_active;
> > >>>>> +
> > >>>>>        /**
> > >>>>>         * lock:
> > >>>>>         *
> > >>>>> @@ -383,6 +399,13 @@ struct msm_file_private {
> > >>>>>         */
> > >>>>>        int sysprof;
> > >>>>>    +    /**
> > >>>>> +     * @counters_requested:
> > >>>>> +     *
> > >>>>> +     * Has the context requested local perfcntr usage.
> > >>>>> +     */
> > >>>>> +    bool counters_requested;
> > >>>>> +
> > >>>>>        /**
> > >>>>>         * comm: Overridden task comm, see MSM_PARAM_COMM
> > >>>>>         *
> > >>>>> @@ -626,6 +649,8 @@ void msm_submitqueue_destroy(struct kref *kref);
> > >>>>>      int msm_file_private_set_sysprof(struct msm_file_private *ctx,
> > >>>>>                     struct msm_gpu *gpu, int sysprof);
> > >>>>> +int msm_file_private_request_counters(struct msm_file_private *ctx,
> > >>>>> +                      struct msm_gpu *gpu, int reqcntrs);
> > >>>>>    void __msm_file_private_destroy(struct kref *kref);
> > >>>>>      static inline void msm_file_private_put(struct msm_file_private
> > >>>>> *ctx)
> > >>>>> diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/
> > >>>>> msm/msm_submitqueue.c
> > >>>>> index 7fed1de63b5d..1e1e21e6f7ae 100644
> > >>>>> --- a/drivers/gpu/drm/msm/msm_submitqueue.c
> > >>>>> +++ b/drivers/gpu/drm/msm/msm_submitqueue.c
> > >>>>> @@ -10,6 +10,15 @@
> > >>>>>    int msm_file_private_set_sysprof(struct msm_file_private *ctx,
> > >>>>>                     struct msm_gpu *gpu, int sysprof)
> > >>>>>    {
> > >>>>> +    int ret = 0;
> > >>>>> +
> > >>>>> +    mutex_lock(&gpu->lock);
> > >>>>> +
> > >>>>> +    if (sysprof && (refcount_read(&gpu->local_counters_active) >
> > >>>>> 1)) {
> > >>>>> +        ret = UERR(EBUSY, gpu->dev, "Local counter usage active");
> > >>>>> +        goto out_unlock;
> > >>>>> +    }
> > >>>>> +
> > >>>>>        /*
> > >>>>>         * Since pm_runtime and sysprof_active are both refcounts, we
> > >>>>>         * call apply the new value first, and then unwind the previous
> > >>>>> @@ -18,7 +27,8 @@ int msm_file_private_set_sysprof(struct
> > >>>>> msm_file_private *ctx,
> > >>>>>          switch (sysprof) {
> > >>>>>        default:
> > >>>>> -        return UERR(EINVAL, gpu->dev, "Invalid sysprof: %d",
> > >>>>> sysprof);
> > >>>>> +        ret = UERR(EINVAL, gpu->dev, "Invalid sysprof: %d", sysprof);
> > >>>>> +        goto out_unlock;
> > >>>>>        case 2:
> > >>>>>            pm_runtime_get_sync(&gpu->pdev->dev);
> > >>>>>            fallthrough;
> > >>>>> @@ -43,7 +53,45 @@ int msm_file_private_set_sysprof(struct
> > >>>>> msm_file_private *ctx,
> > >>>>>          ctx->sysprof = sysprof;
> > >>>>>    -    return 0;
> > >>>>> +out_unlock:
> > >>>>> +    mutex_unlock(&gpu->lock);
> > >>>>> +
> > >>>>> +    return ret;
> > >>>>> +}
> > >>>>> +
> > >>>>> +int msm_file_private_request_counters(struct msm_file_private *ctx,
> > >>>>> +                      struct msm_gpu *gpu, int reqctrs)
> > >>>>> +{
> > >>>>> +    int ret = 0;
> > >>>>> +
> > >>>>> +    mutex_lock(&gpu->lock);
> > >>>>> +
> > >>>>> +    if (reqctrs && (refcount_read(&gpu->sysprof_active) > 1)) {
> > >>>>> +        ret = UERR(EBUSY, gpu->dev, "System profiling active");
> > >>>>> +        goto out_unlock;
> > >>>>> +    }
> > >>>>> +
> > >>>>> +    if (reqctrs) {
> > >>>>> +        if (ctx->counters_requested) {
> > >>>>> +            ret = UERR(EINVAL, gpu->dev, "Already requested");
> > >>>>> +            goto out_unlock;
> > >>>>> +        }
> > >>>>> +
> > >>>>> +        ctx->counters_requested = true;
> > >>>>> +        refcount_inc(&gpu->local_counters_active);
> > >>>>> +    } else {
> > >>>>> +        if (!ctx->counters_requested) {
> > >>>>> +            ret = UERR(EINVAL, gpu->dev, "Not requested");
> > >>>>> +            goto out_unlock;
> > >>>>> +        }
> > >>>>> +        refcount_dec(&gpu->local_counters_active);
> > >>>>> +        ctx->counters_requested = false;
> > >>>>> +    }
> > >>>>> +
> > >>>>> +out_unlock:
> > >>>>> +    mutex_unlock(&gpu->lock);
> > >>>>> +
> > >>>>> +    return ret;
> > >>>>>    }
> > >>>>>      void __msm_file_private_destroy(struct kref *kref)
> > >>>>> diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h
> > >>>>> index 2342cb90857e..ae7fb355e4a1 100644
> > >>>>> --- a/include/uapi/drm/msm_drm.h
> > >>>>> +++ b/include/uapi/drm/msm_drm.h
> > >>>>> @@ -91,6 +91,7 @@ struct drm_msm_timespec {
> > >>>>>    #define MSM_PARAM_UBWC_SWIZZLE 0x12 /* RO */
> > >>>>>    #define MSM_PARAM_MACROTILE_MODE 0x13 /* RO */
> > >>>>>    #define MSM_PARAM_UCHE_TRAP_BASE 0x14 /* RO */
> > >>>>> +#define MSM_PARAM_REQ_CNTRS  0x15 /* WO: request "local" (intra-
> > >>>>> submit) perfcntr usage  */
> > >>>>>      /* For backwards compat.  The original support for preemption was
> > >>>>> based on
> > >>>>>     * a single ring per priority level so # of priority levels equals
> > >>>>> the #
> > >>>>
> > >>>
> > >>>
> > >>> Best regards,
> > >>
> > >
> > >
> > > Best regards,
> >