Re: [PATCH] drm/amdgpu: add the command AMDGPU_INFO_QUEUE_RESET to query queue reset

Alex Deucher <alexdeucher@xxxxxxxxx> · Fri, 18 Oct 2024 09:23:00 -0400

On Fri, Oct 18, 2024 at 7:19 AM Christian König
<christian.koenig@xxxxxxx> wrote:
>
> Am 18.10.24 um 11:33 schrieb Zhang, Jesse(Jie):
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> > Hi Christian,
> >
> > -----Original Message-----
> > From: Koenig, Christian <Christian.Koenig@xxxxxxx>
> > Sent: Friday, October 18, 2024 4:47 PM
> > To: Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>
> > Subject: Re: [PATCH] drm/amdgpu: add the command AMDGPU_INFO_QUEUE_RESET to query queue reset
> >
> > Am 18.10.24 um 10:19 schrieb Jesse.zhang@xxxxxxx:
> >> Not all ASICs support the queue reset feature.
> >> Therefore, userspace can query this feature via
> >> AMDGPU_INFO_QUEUE_RESET before validating a queue reset.
> > Why would UMDs need that information?
> >
> > To verify queue reset.
> > Now the igt uses many asic filters to hard code if queue reset is ready.
> > Alex suggested that we can get the information directly from the driver.
>
> Ah, ok got it. Mhm in general sounds like the approach is cleaner, but
> the IOCTL interface is supposed to be used by the UMD and tested by the
> IGT tests.
>
> The problem is now that it's documented that the justification for
> having the IOCTLs is the UMD and not the IGT tests.
>
> Could we also do this as debugfs interface?
>
> > Another problem is that our driver has added code about queue reset.
> > The reset function is complete (adev->gfx.gfx_ring[0].funcs->reset),
> > but fw partially supports it.
> > For example navi31,   KCQ invalid opcode case can be recovered by queue reset,
> > but KCQ invalid packet length cannot be recovered now.
> >
> > So for this case, I am not sure whether we can return true for this function.
> >
> > More information can be obtained from the link:
> > https://confluence.amd.com/display/AMDGPU/Phase+1+-+Validation+of+Per+Queue+Reset+for+Kernel+Queue
>
> Oh, yeah good question. Alex should probably answer that.

Maybe a better option would be a reset_mask sysfs file for each IP
which we could have a bit for each reset type.  E.g.,

gc_reset_mask
sdma_reset_mask
vcn_reset_mask
jpeg_reset_mask
vpe_reset_mask

AMDGPU_RESET_TYPE_FULL (1 << 0) /* full adapter reset, mode1/mode2/BACO/etc. */
AMDGPU_RESET_TYPE_SOFT_RESET (1 << 1) /* IP level soft reset */
AMDGPU_RESET_TYPE_PER_QUEUE (1 << 2) /* per queue */
AMDGPU_RESET_TYPE_PER_PIPE (1 << 3) /* per pipe */

Then the IGT tests could query the mask and determine which tests to run.

Alex

>
> Regards,
> Christian.
>
> >
> > Thanks
> > Jesse
> >> Signed-off-by: Jesse Zhang <jesse.zhang@xxxxxxx>
> >> ---
> >>    drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 27 +++++++++++++++++++++++++
> >>    include/uapi/drm/amdgpu_drm.h           |  2 ++
> >>    2 files changed, 29 insertions(+)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> >> index b53c35992152..87dee858fb4c 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> >> @@ -577,6 +577,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
> >>        uint64_t ui64 = 0;
> >>        int i, found, ret;
> >>        int ui32_size = sizeof(ui32);
> >> +     bool queue_reset;
> >>
> >>        if (!info->return_size || !info->return_pointer)
> >>                return -EINVAL;
> >> @@ -1282,6 +1283,32 @@ int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
> >>                return copy_to_user(out, &gpuvm_fault,
> >>                                    min((size_t)size, sizeof(gpuvm_fault))) ? -EFAULT : 0;
> >>        }
> >> +     case AMDGPU_INFO_QUEUE_RESET: {
> >> +             fpriv = (struct amdgpu_fpriv *)filp->driver_priv;
> >> +             type = amdgpu_ip_get_block_type(adev, info->query_hw_ip.type);
> >> +             ip_block = amdgpu_device_ip_get_ip_block(adev, type);
> >> +
> >> +             if (!ip_block || !ip_block->status.valid)
> >> +                     return -EINVAL;
> >> +
> >> +             switch (info->query_hw_ip.type) {
> >> +             case AMDGPU_HW_IP_GFX:
> >> +                     queue_reset = adev->gfx.gfx_ring[0].funcs->reset ? true : false;
> > Please never ever use this coding style.
> >
> > If you want to convert anything into a boolean use the !! operator.
> >
> > Regards,
> > Christian.
> >
> >> +                     break;
> >> +             case AMDGPU_HW_IP_COMPUTE:
> >> +                     queue_reset = adev->gfx.compute_ring[0].funcs->reset ? true : false;
> >> +                     break;
> >> +             case AMDGPU_HW_IP_DMA:
> >> +                     queue_reset = adev->sdma.instance[0].ring.funcs->reset ? true : false;
> >> +                     break;
> >> +             case AMDGPU_HW_IP_UVD_ENC:
> >> +             case AMDGPU_HW_IP_VCN_DEC:
> >> +             default:
> >> +                     queue_reset = false;
> >> +             }
> >> +
> >> +             return copy_to_user(out, &queue_reset, min(size, 4u)) ? -EFAULT : 0;
> >> +     }
> >>        default:
> >>                DRM_DEBUG_KMS("Invalid request %d\n", info->query);
> >>                return -EINVAL;
> >> diff --git a/include/uapi/drm/amdgpu_drm.h
> >> b/include/uapi/drm/amdgpu_drm.h index d9bff1c3b326..3b17d82fd1ee
> >> 100644
> >> --- a/include/uapi/drm/amdgpu_drm.h
> >> +++ b/include/uapi/drm/amdgpu_drm.h
> >> @@ -1052,6 +1052,8 @@ struct drm_amdgpu_cs_chunk_cp_gfx_shadow {
> >>    #define AMDGPU_INFO_MAX_IBS                 0x22
> >>    /* query last page fault info */
> >>    #define AMDGPU_INFO_GPUVM_FAULT                     0x23
> >> +/* query queue reset */
> >> +#define AMDGPU_INFO_QUEUE_RESET                      0x24
> >>
> >>    #define AMDGPU_INFO_MMR_SE_INDEX_SHIFT      0
> >>    #define AMDGPU_INFO_MMR_SE_INDEX_MASK       0xff
>