On 7/26/2024 8:11 PM, Khatri, Sunil wrote: > > On 7/26/2024 7:53 PM, Khatri, Sunil wrote: >> >> On 7/26/2024 7:18 PM, Lazar, Lijo wrote: >>> >>> On 7/26/2024 6:42 PM, Alex Deucher wrote: >>>> On Fri, Jul 26, 2024 at 8:48 AM Sunil Khatri <sunil.khatri@xxxxxxx> >>>> wrote: >>>>> Problem: >>>>> IP dump right now is done post suspend of >>>>> all IP's which for some IP's could change power >>>>> state and software state too which we do not want >>>>> to reflect in the dump as it might not be same at >>>>> the time of hang. >>>>> >>>>> Solution: >>>>> IP should be dumped as close to the HW state when >>>>> the GPU was in hung state without trying to reinitialize >>>>> any resource. >>>>> >>>>> Signed-off-by: Sunil Khatri <sunil.khatri@xxxxxxx> >>>> Acked-by: Alex Deucher <alexander.deucher@xxxxxxx> >>>> >>>>> --- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 60 >>>>> +++++++++++----------- >>>>> 1 file changed, 30 insertions(+), 30 deletions(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> index 730dae77570c..74f6f15e73b5 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> @@ -5277,11 +5277,29 @@ int amdgpu_device_mode1_reset(struct >>>>> amdgpu_device *adev) >>>>> return ret; >>>>> } >>>>> >>>>> +static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev) >>>>> +{ >>>>> + int i; >>>>> + >>>>> + lockdep_assert_held(&adev->reset_domain->sem); >>>>> + >>>>> + for (i = 0; i < adev->reset_info.num_regs; i++) { >>>>> + adev->reset_info.reset_dump_reg_value[i] = >>>>> + >>>>> RREG32(adev->reset_info.reset_dump_reg_list[i]); >>> A suspend also involves power/clock ungate. When reg dump is moved >>> earlier, I'm not sure if this read works for all. If it's left to >>> individual IP call backs, they could just do the same or better to move >>> these up before a dump. >> Suspend also put the status.hw = false and each IP in their respective >> suspend state which i feel does change the state of the HW. >> To get the correct snapshot of the GPU register we should not be >> fiddling with the HW IP at least till we capture the dump and that is >> the intention behind the change. >> >> Do you think there is a problem in this approach? >>> >>> amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE); >>> amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE); >> Adding this does sounds better to enable just before we dump the >> registers but i am not very sure if ungating would be clean here or >> not. i Could try quickly adding these two calls just before dump. >> >> All i am worried if it does change some register reflecting the >> original state of registers at dump. >> I was thinking that if it includes some GFX regs and the hang happened because of some SDMA/VCN jobs which somehow keeps GFXOFF state intact. BTW, since there is already dump_ip state which could capture IP regs separately and handle their power/clock gate situations, do you think this generic one is still needed? Thanks, Lijo >> What u suggest ? >> Regards >> Sunil > I quickly validated on Navi22 by adding below call just before we dump > registers > if(!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)) { > > amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE); > amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE); > > amdgpu_reset_reg_dumps(tmp_adev); > dev_info(tmp_adev->dev, "Dumping IP State\n"); > /* Trigger ip dump before we reset the asic */ > for(i=0; i<tmp_adev->num_ip_blocks; i++) > if(tmp_adev->ip_blocks[i].version->funcs->dump_ip_state) > tmp_adev->ip_blocks[i].version->funcs->dump_ip_state( > (void*)tmp_adev); > dev_info(tmp_adev->dev, "Dumping IP State Completed\n"); > } > If this sounds fine with you i am update that. Regards Sunil Khatri >> >>> >>> Thanks, >>> Lijo >>> >>>>> + >>>>> + >>>>> trace_amdgpu_reset_reg_dumps(adev->reset_info.reset_dump_reg_list[i], >>>>> + >>>>> adev->reset_info.reset_dump_reg_value[i]); >>>>> + } >>>>> + >>>>> + return 0; >>>>> +} >>>>> + >>>>> int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev, >>>>> struct amdgpu_reset_context >>>>> *reset_context) >>>>> { >>>>> int i, r = 0; >>>>> struct amdgpu_job *job = NULL; >>>>> + struct amdgpu_device *tmp_adev = reset_context->reset_req_dev; >>>>> bool need_full_reset = >>>>> test_bit(AMDGPU_NEED_FULL_RESET, >>>>> &reset_context->flags); >>>>> >>>>> @@ -5340,6 +5358,18 @@ int amdgpu_device_pre_asic_reset(struct >>>>> amdgpu_device *adev, >>>>> } >>>>> } >>>>> >>>>> + if (!test_bit(AMDGPU_SKIP_COREDUMP, >>>>> &reset_context->flags)) { >>>>> + amdgpu_reset_reg_dumps(tmp_adev); >>>>> + >>>>> + dev_info(tmp_adev->dev, "Dumping IP State\n"); >>>>> + /* Trigger ip dump before we reset the asic */ >>>>> + for (i = 0; i < tmp_adev->num_ip_blocks; i++) >>>>> + if >>>>> (tmp_adev->ip_blocks[i].version->funcs->dump_ip_state) >>>>> + >>>>> tmp_adev->ip_blocks[i].version->funcs->dump_ip_state( >>>>> + (void >>>>> *)tmp_adev); >>>>> + dev_info(tmp_adev->dev, "Dumping IP State >>>>> Completed\n"); >>>>> + } >>>>> + >>>>> if (need_full_reset) >>>>> r = amdgpu_device_ip_suspend(adev); >>>>> if (need_full_reset) >>>>> @@ -5352,47 +5382,17 @@ int amdgpu_device_pre_asic_reset(struct >>>>> amdgpu_device *adev, >>>>> return r; >>>>> } >>>>> >>>>> -static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev) >>>>> -{ >>>>> - int i; >>>>> - >>>>> - lockdep_assert_held(&adev->reset_domain->sem); >>>>> - >>>>> - for (i = 0; i < adev->reset_info.num_regs; i++) { >>>>> - adev->reset_info.reset_dump_reg_value[i] = >>>>> - >>>>> RREG32(adev->reset_info.reset_dump_reg_list[i]); >>>>> - >>>>> - >>>>> trace_amdgpu_reset_reg_dumps(adev->reset_info.reset_dump_reg_list[i], >>>>> - >>>>> adev->reset_info.reset_dump_reg_value[i]); >>>>> - } >>>>> - >>>>> - return 0; >>>>> -} >>>>> - >>>>> int amdgpu_do_asic_reset(struct list_head *device_list_handle, >>>>> struct amdgpu_reset_context *reset_context) >>>>> { >>>>> struct amdgpu_device *tmp_adev = NULL; >>>>> bool need_full_reset, skip_hw_reset, vram_lost = false; >>>>> int r = 0; >>>>> - uint32_t i; >>>>> >>>>> /* Try reset handler method first */ >>>>> tmp_adev = list_first_entry(device_list_handle, struct >>>>> amdgpu_device, >>>>> reset_list); >>>>> >>>>> - if (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)) { >>>>> - amdgpu_reset_reg_dumps(tmp_adev); >>>>> - >>>>> - dev_info(tmp_adev->dev, "Dumping IP State\n"); >>>>> - /* Trigger ip dump before we reset the asic */ >>>>> - for (i = 0; i < tmp_adev->num_ip_blocks; i++) >>>>> - if >>>>> (tmp_adev->ip_blocks[i].version->funcs->dump_ip_state) >>>>> - tmp_adev->ip_blocks[i].version->funcs >>>>> - ->dump_ip_state((void *)tmp_adev); >>>>> - dev_info(tmp_adev->dev, "Dumping IP State >>>>> Completed\n"); >>>>> - } >>>>> - >>>>> reset_context->reset_device_list = device_list_handle; >>>>> r = amdgpu_reset_perform_reset(tmp_adev, reset_context); >>>>> /* If reset handler not implemented, continue; otherwise >>>>> return */ >>>>> -- >>>>> 2.34.1 >>>>>