Re: [PATCH 2/2] drm/amdgpu: add reset register trace function on GPU reset

Alex Deucher <alexdeucher@xxxxxxxxx> · Thu, 10 Feb 2022 10:59:48 -0500

On Thu, Feb 10, 2022 at 9:11 AM Sharma, Shashank
<shashank.sharma@xxxxxxx> wrote:
>
>
>
> On 2/10/2022 3:05 PM, Christian König wrote:
> > Am 10.02.22 um 14:18 schrieb Sharma, Shashank:
> >>
> >>
> >> On 2/10/2022 8:38 AM, Christian König wrote:
> >>> Am 10.02.22 um 08:34 schrieb Somalapuram, Amaranath:
> >>>>
> >>>> On 2/10/2022 12:39 PM, Christian König wrote:
> >>>>> Am 10.02.22 um 06:29 schrieb Somalapuram, Amaranath:
> >>>>>>
> >>>>>> On 2/9/2022 1:17 PM, Christian König wrote:
> >>>>>>> Am 08.02.22 um 16:28 schrieb Alex Deucher:
> >>>>>>>> On Tue, Feb 8, 2022 at 3:17 AM Somalapuram Amaranath
> >>>>>>>> <Amaranath.Somalapuram@xxxxxxx> wrote:
> >>>>>>>>> Dump the list of register values to trace event on GPU reset.
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Somalapuram Amaranath
> >>>>>>>>> <Amaranath.Somalapuram@xxxxxxx>
> >>>>>>>>> ---
> >>>>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21
> >>>>>>>>> ++++++++++++++++++++-
> >>>>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h  | 19
> >>>>>>>>> +++++++++++++++++++
> >>>>>>>>>   2 files changed, 39 insertions(+), 1 deletion(-)
> >>>>>>>>>
> >>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>>>>>>> index 1e651b959141..057922fb7e37 100644
> >>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>>>>>>> @@ -4534,6 +4534,23 @@ int amdgpu_device_pre_asic_reset(struct
> >>>>>>>>> amdgpu_device *adev,
> >>>>>>>>>          return r;
> >>>>>>>>>   }
> >>>>>>>>>
> >>>>>>>>> +static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
> >>>>>>>>> +{
> >>>>>>>>> +       int i;
> >>>>>>>>> +       uint32_t reg_value[128];
> >>>>>>>>> +
> >>>>>>>>> +       for (i = 0; adev->reset_dump_reg_list[i] != 0; i++) {
> >>>>>>>>> +               if (adev->asic_type >= CHIP_NAVI10)
> >>>>>>>> This check should be against CHIP_VEGA10.  Also, this only
> >>>>>>>> allows for
> >>>>>>>> GC registers.  If we wanted to dump other registers, we'd need a
> >>>>>>>> different macro.  Might be better to just use RREG32 here for
> >>>>>>>> everything and then encode the full offset using
> >>>>>>>> SOC15_REG_ENTRY_OFFSET() or a similar macro.  Also, we need to
> >>>>>>>> think
> >>>>>>>> about how to handle gfxoff in this case.  gfxoff needs to be
> >>>>>>>> disabled
> >>>>>>>> or we'll hang the chip if we try and read GC or SDMA registers via
> >>>>>>>> MMIO which will adversely affect the hang signature.
> >>>>>>>
> >>>>>>> Well this should execute right before a GPU reset, so I think it
> >>>>>>> shouldn't matter if we hang the chip or not as long as the read
> >>>>>>> comes back correctly (I remember a very long UVD debug session
> >>>>>>> because of this).
> >>>>>>>
> >>>>>>> But in general I agree, we should just use RREG32() here and
> >>>>>>> always encode the full register offset.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Christian.
> >>>>>>>
> >>>>>> Can I use something like this:
> >>>>>>
> >>>>>> +                       reg_value[i] =
> >>>>>> RREG32((adev->reg_offset[adev->reset_dump_reg_list[i][0]]
> >>>>>> + [adev->reset_dump_reg_list[i][1]]
> >>>>>> + [adev->reset_dump_reg_list[i][2]])
> >>>>>> +                                 + adev->reset_dump_reg_list[i][3]);
> >>>>>>
> >>>>>> ip --> adev->reset_dump_reg_list[i][0]
> >>>>>>
> >>>>>> inst --> adev->reset_dump_reg_list[i][1]
> >>>>>>
> >>>>>> BASE_IDX--> adev->reset_dump_reg_list[i][2]
> >>>>>>
> >>>>>> reg --> adev->reset_dump_reg_list[i][3]
> >>>>>
> >>>>> No, that won't work.
> >>>>>
> >>>>> What you need to do is to use the full 32bit address of the
> >>>>> register. Userspace can worry about figuring out which ip,
> >>>>> instance, base and reg to resolve into that address.
> >>>>>
> >>>>> Regards,
> >>>>> Christian.
> >>>>>
> >>>> Thanks Christian.
> >>>>
> >>>> should I consider using gfxoff like below code or not required:
> >>>> amdgpu_gfx_off_ctrl(adev, false);
> >>>> amdgpu_gfx_off_ctrl(adev, true);
> >>>
> >>> That's a really good question I can't fully answer.
> >>>
> >>> I think we don't want that because the GPU is stuck when the dump is
> >>> made, but better let Alex comment as well.
> >>>
> >>> Regards,
> >>> Christian.
> >>
> >>
> >> I had a quick look at the function amdgpu_gfx_off_ctrl, and it locks
> >> this mutex internally:
> >> mutex_lock(&adev->gfx.gfx_off_mutex);
> >>
> >> and the reference state is tracked in:
> >> adev->gfx.gfx_off_state
> >>
> >> We can do something like this maybe:
> >> - If (adev->gfx_off_state == 0) {
> >>   trylock(gfx_off_mutex)
> >>   read_regs_now;
> >>   unlock_mutex();
> >> }
> >>
> >> How does it sounds ?
> >
> > As far as I know that won't work. GFX_off is only disabled intentionally
> > in very few places.
> >
> > So we will probably never get a register trace with that.
> >
>
> Ok, I don't know much about this feature, but due to the name I was
> udner the impression that gfx_off will be mostly disabled. But if it is
> hardly ever disabled, we need more infomrmation about it first, like
> when is this disabled and why ?
>
> Alex ?

gfxoff is enabled most of the time.  The only times we disable it are
for certain features (reading gfx or sdma MMIO registers, using SQTT,
etc.).  The two main cases where it would get disabled are calling the
READ_MMR query to the INFO ioctl and if you change the power state to
a profiling mode (either via sysfs or the context ioctl).  If you
don't disable gfx off and you access a register via MMIO, the GFX
block will hang.  So depending on the the nature of the hang, gfxoff
may or may not be a problem.  E.g., if the hang is caused by VCN or
some power management thing gfxoff may still kick in and power down
the gfx block.  If the gfx block itself is hung, then most likely
gfxoff would not be active since the gfx block is active (and hung).
So if we didn't disable gfxoff and it is active, and then we try and
read the registers, we'll hang the gfx block.  That's probably fine
since we'll be reseting the GPU anyway, but it will mean any registers
in the dump will be worthless.

Alex

>
> - Shashank
>
> > Regards,
> > Christian.
> >
> >>
> >> - Shashank
> >>>
> >>>>
> >>>>
> >>>> Regards,
> >>>>
> >>>> S.Amarnath
> >>>>>>
> >>>>>> which requires 4 values in user space for each register.
> >>>>>>
> >>>>>> using any existing macro like RREG32_SOC15** will not be able to
> >>>>>> pass proper argument from user space (like ip##_HWIP or
> >>>>>> reg##_BASE_IDX)
> >>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Alex
> >>>>>>>>
> >>>>>>>>> + reg_value[i] = RREG32_SOC15_IP(GC,
> >>>>>>>>> adev->reset_dump_reg_list[i]);
> >>>>>>>>> +               else
> >>>>>>>>> +                       reg_value[i] =
> >>>>>>>>> RREG32(adev->reset_dump_reg_list[i]);
> >>>>>>>>> +       }
> >>>>>>>>> +
> >>>>>>>>> + trace_amdgpu_reset_reg_dumps(adev->reset_dump_reg_list,
> >>>>>>>>> reg_value, i);
> >>>>>>>>> +
> >>>>>>>>> +       return 0;
> >>>>>>>>> +}
> >>>>>>>>> +
> >>>>>>>>>   int amdgpu_do_asic_reset(struct list_head *device_list_handle,
> >>>>>>>>>                           struct amdgpu_reset_context
> >>>>>>>>> *reset_context)
> >>>>>>>>>   {
> >>>>>>>>> @@ -4567,8 +4584,10 @@ int amdgpu_do_asic_reset(struct
> >>>>>>>>> list_head *device_list_handle,
> >>>>>>>>> tmp_adev->gmc.xgmi.pending_reset = false;
> >>>>>>>>>                                  if
> >>>>>>>>> (!queue_work(system_unbound_wq, &tmp_adev->xgmi_reset_work))
> >>>>>>>>>                                          r = -EALREADY;
> >>>>>>>>> -                       } else
> >>>>>>>>> +                       } else {
> >>>>>>>>> + amdgpu_reset_reg_dumps(tmp_adev);
> >>>>>>>>>                                  r = amdgpu_asic_reset(tmp_adev);
> >>>>>>>>> +                       }
> >>>>>>>>>
> >>>>>>>>>                          if (r) {
> >>>>>>>>> dev_err(tmp_adev->dev, "ASIC reset failed with error, %d for
> >>>>>>>>> drm dev, %s",
> >>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> >>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> >>>>>>>>> index d855cb53c7e0..3fe33de3564a 100644
> >>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> >>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> >>>>>>>>> @@ -537,6 +537,25 @@ TRACE_EVENT(amdgpu_ib_pipe_sync,
> >>>>>>>>>                        __entry->seqno)
> >>>>>>>>>   );
> >>>>>>>>>
> >>>>>>>>> +TRACE_EVENT(amdgpu_reset_reg_dumps,
> >>>>>>>>> +           TP_PROTO(long *address, uint32_t *value, int length),
> >>>>>>>>> +           TP_ARGS(address, value, length),
> >>>>>>>>> +           TP_STRUCT__entry(
> >>>>>>>>> +                            __array(long, address, 128)
> >>>>>>>>> +                            __array(uint32_t, value, 128)
> >>>>>>>>> +                            __field(int, len)
> >>>>>>>>> +                            ),
> >>>>>>>>> +           TP_fast_assign(
> >>>>>>>>> + memcpy(__entry->address, address, 128);
> >>>>>>>>> +                          memcpy(__entry->value, value, 128);
> >>>>>>>>> +                          __entry->len = length;
> >>>>>>>>> +                          ),
> >>>>>>>>> +           TP_printk("amdgpu register dump offset: %s value:
> >>>>>>>>> %s ",
> >>>>>>>>> + __print_array(__entry->address, __entry->len, 8),
> >>>>>>>>> + __print_array(__entry->value, __entry->len, 8)
> >>>>>>>>> +                    )
> >>>>>>>>> +);
> >>>>>>>>> +
> >>>>>>>>>   #undef AMDGPU_JOB_GET_TIMELINE_NAME
> >>>>>>>>>   #endif
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> 2.25.1
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> >