Re: [PATCH 2/2] drm/amdgpu: trigger ip dump before suspend of IP's

"Lazar, Lijo" <lijo.lazar@xxxxxxx> · Mon, 29 Jul 2024 10:08:32 +0530

On 7/27/2024 12:51 AM, Khatri, Sunil wrote:
> 
> On 7/27/2024 12:13 AM, Alex Deucher wrote:
>> On Fri, Jul 26, 2024 at 1:16 PM Khatri, Sunil <sukhatri@xxxxxxx> wrote:
>>>
>>> On 7/26/2024 8:36 PM, Lazar, Lijo wrote:
>>>> On 7/26/2024 8:11 PM, Khatri, Sunil wrote:
>>>>> On 7/26/2024 7:53 PM, Khatri, Sunil wrote:
>>>>>> On 7/26/2024 7:18 PM, Lazar, Lijo wrote:
>>>>>>> On 7/26/2024 6:42 PM, Alex Deucher wrote:
>>>>>>>> On Fri, Jul 26, 2024 at 8:48 AM Sunil Khatri <sunil.khatri@xxxxxxx>
>>>>>>>> wrote:
>>>>>>>>> Problem:
>>>>>>>>> IP dump right now is done post suspend of
>>>>>>>>> all IP's which for some IP's could change power
>>>>>>>>> state and software state too which we do not want
>>>>>>>>> to reflect in the dump as it might not be same at
>>>>>>>>> the time of hang.
>>>>>>>>>
>>>>>>>>> Solution:
>>>>>>>>> IP should be dumped as close to the HW state when
>>>>>>>>> the GPU was in hung state without trying to reinitialize
>>>>>>>>> any resource.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Sunil Khatri <sunil.khatri@xxxxxxx>
>>>>>>>> Acked-by: Alex Deucher <alexander.deucher@xxxxxxx>
>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 60
>>>>>>>>> +++++++++++-----------
>>>>>>>>>     1 file changed, 30 insertions(+), 30 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> index 730dae77570c..74f6f15e73b5 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> @@ -5277,11 +5277,29 @@ int amdgpu_device_mode1_reset(struct
>>>>>>>>> amdgpu_device *adev)
>>>>>>>>>            return ret;
>>>>>>>>>     }
>>>>>>>>>
>>>>>>>>> +static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
>>>>>>>>> +{
>>>>>>>>> +       int i;
>>>>>>>>> +
>>>>>>>>> +       lockdep_assert_held(&adev->reset_domain->sem);
>>>>>>>>> +
>>>>>>>>> +       for (i = 0; i < adev->reset_info.num_regs; i++) {
>>>>>>>>> +               adev->reset_info.reset_dump_reg_value[i] =
>>>>>>>>> +
>>>>>>>>> RREG32(adev->reset_info.reset_dump_reg_list[i]);
>>>>>>> A suspend also involves power/clock ungate. When reg dump is moved
>>>>>>> earlier, I'm not sure if this read works for all. If it's left to
>>>>>>> individual IP call backs, they could just do the same or better
>>>>>>> to move
>>>>>>> these up before a dump.
>>>>>> Suspend also put the status.hw = false and each IP in their
>>>>>> respective
>>>>>> suspend state which i feel does change the state of the HW.
>>>>>> To get the correct snapshot of the GPU register we should not be
>>>>>> fiddling with the HW IP at least till we capture the dump and that is
>>>>>> the intention behind the change.
>>>>>>
>>>>>> Do you think there is a problem in this approach?
>>>>>>>            amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
>>>>>>>            amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
>>>>>> Adding this does sounds better to enable just before we dump the
>>>>>> registers but i am not very sure if ungating would be clean here or
>>>>>> not. i Could try quickly adding these two calls just before dump.
>>>>>>
>>>>>> All i am worried if it does change some register reflecting the
>>>>>> original state of registers at dump.
>>>>>>
>>>> I was thinking that if it includes some GFX regs and the hang happened
>>>> because of some SDMA/VCN jobs which somehow keeps GFXOFF state intact.
>>> For GFX and SDMA hangs we make sure to disable GFXOFF before so for
>>> those IP's
>>> I am not worried and also based on my testing on NAVI22 for GPU hang
>>> their registers
>>> seems to be read cleanly.
>>> Another point that i was thinking is after a GPU hang no where till the
>>> point of dump
>>> any registers are touched and that is what we need considering we are
>>> able to read the registers.
>>>
>>> For VCN there is dynamic gating which is controlled by the FW if i am
>>> not wrong which makes the VCN
>>> off and registers cant be read. Only in case of VCN hang i am assuming
>>> due to a Job pending VCN should be in power ON
>>> state and out intent of reading the registers should work fine. Sadly i
>>> am unable to validate it as there arent any test readily
>>> available to hang VCN.
>> We want to take the register dump as early as possible in the reset
>> sequence, ideally before any of the hw gets touched as part of the
>> reset sequence.  Otherwise, the dump is not going to be useful.
>>
>> Alex
> 
> Sure Alex. I am also of the same view that we dont want to change any
> power status of any IP as it could change the values.

There is a debugfs interface 'amdgpu_reset_dump_register_list_write' tp
add registers to reset_info.reset_dump_reg_list. Presently there is no
check about which registers are added to that list. For ex: if user has
added some GFX related registers, this is going to hang while in GFXOFF
as ip dump state comes later.

Also, all IPs don't handle dynamic wakeup; therefore, regardless of a
reset scenario, direct access to powergated IPs could result in a hang
and that will make things worse.

Thanks,
Lijo

> 
> Regards
> Sunil Khatri
> 
>>
>>
>>>> BTW, since there is already dump_ip state which could capture IP regs
>>>> separately and handle their power/clock gate situations, do you think
>>>> this generic one is still needed?
>>>>
>>>> For whatever we have implemented till now seems to work fine in case
>>>> of GPU hang withotu fidling the
>>>> power state explicitly. But Calling suspend of each IP surely seems
>>>> to change some state in IPs and SW state too.
>>>> So i think until we experience a real problem we should go without
>>>> the generic UNGATE call for all IP's
>>>>
>>>> But i am not an expert of power, so whatever you suggest would make
>>>> more sense for the driver.
>>> Regards
>>> Sunil Khatri
>>>> Thanks,
>>>> Lijo
>>>>
>>>>>> What u suggest ?
>>>>>> Regards
>>>>>> Sunil
>>>>> I quickly validated on Navi22 by adding below call just before we dump
>>>>> registers
>>>>> if(!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)) {
>>>>>
>>>>>       amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
>>>>>       amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
>>>>>
>>>>>       amdgpu_reset_reg_dumps(tmp_adev);
>>>>>       dev_info(tmp_adev->dev, "Dumping IP State\n");
>>>>>       /* Trigger ip dump before we reset the asic */
>>>>>       for(i=0; i<tmp_adev->num_ip_blocks; i++)
>>>>>           if(tmp_adev->ip_blocks[i].version->funcs->dump_ip_state)
>>>>>               tmp_adev->ip_blocks[i].version->funcs->dump_ip_state(
>>>>>                                       (void*)tmp_adev);
>>>>>       dev_info(tmp_adev->dev, "Dumping IP State Completed\n");
>>>>> }
>>>>> If this sounds fine with you i am update that. Regards Sunil Khatri
>>>>>>> Thanks,
>>>>>>> Lijo
>>>>>>>
>>>>>>>>> +
>>>>>>>>> +
>>>>>>>>> trace_amdgpu_reset_reg_dumps(adev->reset_info.reset_dump_reg_list[i],
>>>>>>>>> +
>>>>>>>>> adev->reset_info.reset_dump_reg_value[i]);
>>>>>>>>> +       }
>>>>>>>>> +
>>>>>>>>> +       return 0;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>>     int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>>>>>>>>>                                     struct amdgpu_reset_context
>>>>>>>>> *reset_context)
>>>>>>>>>     {
>>>>>>>>>            int i, r = 0;
>>>>>>>>>            struct amdgpu_job *job = NULL;
>>>>>>>>> +       struct amdgpu_device *tmp_adev =
>>>>>>>>> reset_context->reset_req_dev;
>>>>>>>>>            bool need_full_reset =
>>>>>>>>>                    test_bit(AMDGPU_NEED_FULL_RESET,
>>>>>>>>> &reset_context->flags);
>>>>>>>>>
>>>>>>>>> @@ -5340,6 +5358,18 @@ int amdgpu_device_pre_asic_reset(struct
>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>                            }
>>>>>>>>>                    }
>>>>>>>>>
>>>>>>>>> +               if (!test_bit(AMDGPU_SKIP_COREDUMP,
>>>>>>>>> &reset_context->flags)) {
>>>>>>>>> +                       amdgpu_reset_reg_dumps(tmp_adev);
>>>>>>>>> +
>>>>>>>>> +                       dev_info(tmp_adev->dev, "Dumping IP
>>>>>>>>> State\n");
>>>>>>>>> +                       /* Trigger ip dump before we reset the
>>>>>>>>> asic */
>>>>>>>>> +                       for (i = 0; i <
>>>>>>>>> tmp_adev->num_ip_blocks; i++)
>>>>>>>>> +                               if
>>>>>>>>> (tmp_adev->ip_blocks[i].version->funcs->dump_ip_state)
>>>>>>>>> +
>>>>>>>>> tmp_adev->ip_blocks[i].version->funcs->dump_ip_state(
>>>>>>>>> +                                                       (void
>>>>>>>>> *)tmp_adev);
>>>>>>>>> +                       dev_info(tmp_adev->dev, "Dumping IP State
>>>>>>>>> Completed\n");
>>>>>>>>> +               }
>>>>>>>>> +
>>>>>>>>>                    if (need_full_reset)
>>>>>>>>>                            r = amdgpu_device_ip_suspend(adev);
>>>>>>>>>                    if (need_full_reset)
>>>>>>>>> @@ -5352,47 +5382,17 @@ int amdgpu_device_pre_asic_reset(struct
>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>            return r;
>>>>>>>>>     }
>>>>>>>>>
>>>>>>>>> -static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
>>>>>>>>> -{
>>>>>>>>> -       int i;
>>>>>>>>> -
>>>>>>>>> -       lockdep_assert_held(&adev->reset_domain->sem);
>>>>>>>>> -
>>>>>>>>> -       for (i = 0; i < adev->reset_info.num_regs; i++) {
>>>>>>>>> -               adev->reset_info.reset_dump_reg_value[i] =
>>>>>>>>> -
>>>>>>>>> RREG32(adev->reset_info.reset_dump_reg_list[i]);
>>>>>>>>> -
>>>>>>>>> -
>>>>>>>>> trace_amdgpu_reset_reg_dumps(adev->reset_info.reset_dump_reg_list[i],
>>>>>>>>> -
>>>>>>>>> adev->reset_info.reset_dump_reg_value[i]);
>>>>>>>>> -       }
>>>>>>>>> -
>>>>>>>>> -       return 0;
>>>>>>>>> -}
>>>>>>>>> -
>>>>>>>>>     int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>>>>>>>>>                             struct amdgpu_reset_context
>>>>>>>>> *reset_context)
>>>>>>>>>     {
>>>>>>>>>            struct amdgpu_device *tmp_adev = NULL;
>>>>>>>>>            bool need_full_reset, skip_hw_reset, vram_lost = false;
>>>>>>>>>            int r = 0;
>>>>>>>>> -       uint32_t i;
>>>>>>>>>
>>>>>>>>>            /* Try reset handler method first */
>>>>>>>>>            tmp_adev = list_first_entry(device_list_handle, struct
>>>>>>>>> amdgpu_device,
>>>>>>>>>                                        reset_list);
>>>>>>>>>
>>>>>>>>> -       if (!test_bit(AMDGPU_SKIP_COREDUMP,
>>>>>>>>> &reset_context->flags)) {
>>>>>>>>> -               amdgpu_reset_reg_dumps(tmp_adev);
>>>>>>>>> -
>>>>>>>>> -               dev_info(tmp_adev->dev, "Dumping IP State\n");
>>>>>>>>> -               /* Trigger ip dump before we reset the asic */
>>>>>>>>> -               for (i = 0; i < tmp_adev->num_ip_blocks; i++)
>>>>>>>>> -                       if
>>>>>>>>> (tmp_adev->ip_blocks[i].version->funcs->dump_ip_state)
>>>>>>>>> -                              
>>>>>>>>> tmp_adev->ip_blocks[i].version->funcs
>>>>>>>>> -                               ->dump_ip_state((void *)tmp_adev);
>>>>>>>>> -               dev_info(tmp_adev->dev, "Dumping IP State
>>>>>>>>> Completed\n");
>>>>>>>>> -       }
>>>>>>>>> -
>>>>>>>>>            reset_context->reset_device_list = device_list_handle;
>>>>>>>>>            r = amdgpu_reset_perform_reset(tmp_adev,
>>>>>>>>> reset_context);
>>>>>>>>>            /* If reset handler not implemented, continue;
>>>>>>>>> otherwise
>>>>>>>>> return */
>>>>>>>>> -- 
>>>>>>>>> 2.34.1
>>>>>>>>>