On 7/27/2024 12:51 AM, Khatri, Sunil wrote: > > On 7/27/2024 12:13 AM, Alex Deucher wrote: >> On Fri, Jul 26, 2024 at 1:16 PM Khatri, Sunil <sukhatri@xxxxxxx> wrote: >>> >>> On 7/26/2024 8:36 PM, Lazar, Lijo wrote: >>>> On 7/26/2024 8:11 PM, Khatri, Sunil wrote: >>>>> On 7/26/2024 7:53 PM, Khatri, Sunil wrote: >>>>>> On 7/26/2024 7:18 PM, Lazar, Lijo wrote: >>>>>>> On 7/26/2024 6:42 PM, Alex Deucher wrote: >>>>>>>> On Fri, Jul 26, 2024 at 8:48 AM Sunil Khatri <sunil.khatri@xxxxxxx> >>>>>>>> wrote: >>>>>>>>> Problem: >>>>>>>>> IP dump right now is done post suspend of >>>>>>>>> all IP's which for some IP's could change power >>>>>>>>> state and software state too which we do not want >>>>>>>>> to reflect in the dump as it might not be same at >>>>>>>>> the time of hang. >>>>>>>>> >>>>>>>>> Solution: >>>>>>>>> IP should be dumped as close to the HW state when >>>>>>>>> the GPU was in hung state without trying to reinitialize >>>>>>>>> any resource. >>>>>>>>> >>>>>>>>> Signed-off-by: Sunil Khatri <sunil.khatri@xxxxxxx> >>>>>>>> Acked-by: Alex Deucher <alexander.deucher@xxxxxxx> >>>>>>>> >>>>>>>>> --- >>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 60 >>>>>>>>> +++++++++++----------- >>>>>>>>> 1 file changed, 30 insertions(+), 30 deletions(-) >>>>>>>>> >>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>> index 730dae77570c..74f6f15e73b5 100644 >>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>> @@ -5277,11 +5277,29 @@ int amdgpu_device_mode1_reset(struct >>>>>>>>> amdgpu_device *adev) >>>>>>>>> return ret; >>>>>>>>> } >>>>>>>>> >>>>>>>>> +static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev) >>>>>>>>> +{ >>>>>>>>> + int i; >>>>>>>>> + >>>>>>>>> + lockdep_assert_held(&adev->reset_domain->sem); >>>>>>>>> + >>>>>>>>> + for (i = 0; i < adev->reset_info.num_regs; i++) { >>>>>>>>> + adev->reset_info.reset_dump_reg_value[i] = >>>>>>>>> + >>>>>>>>> RREG32(adev->reset_info.reset_dump_reg_list[i]); >>>>>>> A suspend also involves power/clock ungate. When reg dump is moved >>>>>>> earlier, I'm not sure if this read works for all. If it's left to >>>>>>> individual IP call backs, they could just do the same or better >>>>>>> to move >>>>>>> these up before a dump. >>>>>> Suspend also put the status.hw = false and each IP in their >>>>>> respective >>>>>> suspend state which i feel does change the state of the HW. >>>>>> To get the correct snapshot of the GPU register we should not be >>>>>> fiddling with the HW IP at least till we capture the dump and that is >>>>>> the intention behind the change. >>>>>> >>>>>> Do you think there is a problem in this approach? >>>>>>> amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE); >>>>>>> amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE); >>>>>> Adding this does sounds better to enable just before we dump the >>>>>> registers but i am not very sure if ungating would be clean here or >>>>>> not. i Could try quickly adding these two calls just before dump. >>>>>> >>>>>> All i am worried if it does change some register reflecting the >>>>>> original state of registers at dump. >>>>>> >>>> I was thinking that if it includes some GFX regs and the hang happened >>>> because of some SDMA/VCN jobs which somehow keeps GFXOFF state intact. >>> For GFX and SDMA hangs we make sure to disable GFXOFF before so for >>> those IP's >>> I am not worried and also based on my testing on NAVI22 for GPU hang >>> their registers >>> seems to be read cleanly. >>> Another point that i was thinking is after a GPU hang no where till the >>> point of dump >>> any registers are touched and that is what we need considering we are >>> able to read the registers. >>> >>> For VCN there is dynamic gating which is controlled by the FW if i am >>> not wrong which makes the VCN >>> off and registers cant be read. Only in case of VCN hang i am assuming >>> due to a Job pending VCN should be in power ON >>> state and out intent of reading the registers should work fine. Sadly i >>> am unable to validate it as there arent any test readily >>> available to hang VCN. >> We want to take the register dump as early as possible in the reset >> sequence, ideally before any of the hw gets touched as part of the >> reset sequence. Otherwise, the dump is not going to be useful. >> >> Alex > > Sure Alex. I am also of the same view that we dont want to change any > power status of any IP as it could change the values. There is a debugfs interface 'amdgpu_reset_dump_register_list_write' tp add registers to reset_info.reset_dump_reg_list. Presently there is no check about which registers are added to that list. For ex: if user has added some GFX related registers, this is going to hang while in GFXOFF as ip dump state comes later. Also, all IPs don't handle dynamic wakeup; therefore, regardless of a reset scenario, direct access to powergated IPs could result in a hang and that will make things worse. Thanks, Lijo > > Regards > Sunil Khatri > >> >> >>>> BTW, since there is already dump_ip state which could capture IP regs >>>> separately and handle their power/clock gate situations, do you think >>>> this generic one is still needed? >>>> >>>> For whatever we have implemented till now seems to work fine in case >>>> of GPU hang withotu fidling the >>>> power state explicitly. But Calling suspend of each IP surely seems >>>> to change some state in IPs and SW state too. >>>> So i think until we experience a real problem we should go without >>>> the generic UNGATE call for all IP's >>>> >>>> But i am not an expert of power, so whatever you suggest would make >>>> more sense for the driver. >>> Regards >>> Sunil Khatri >>>> Thanks, >>>> Lijo >>>> >>>>>> What u suggest ? >>>>>> Regards >>>>>> Sunil >>>>> I quickly validated on Navi22 by adding below call just before we dump >>>>> registers >>>>> if(!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)) { >>>>> >>>>> amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE); >>>>> amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE); >>>>> >>>>> amdgpu_reset_reg_dumps(tmp_adev); >>>>> dev_info(tmp_adev->dev, "Dumping IP State\n"); >>>>> /* Trigger ip dump before we reset the asic */ >>>>> for(i=0; i<tmp_adev->num_ip_blocks; i++) >>>>> if(tmp_adev->ip_blocks[i].version->funcs->dump_ip_state) >>>>> tmp_adev->ip_blocks[i].version->funcs->dump_ip_state( >>>>> (void*)tmp_adev); >>>>> dev_info(tmp_adev->dev, "Dumping IP State Completed\n"); >>>>> } >>>>> If this sounds fine with you i am update that. Regards Sunil Khatri >>>>>>> Thanks, >>>>>>> Lijo >>>>>>> >>>>>>>>> + >>>>>>>>> + >>>>>>>>> trace_amdgpu_reset_reg_dumps(adev->reset_info.reset_dump_reg_list[i], >>>>>>>>> + >>>>>>>>> adev->reset_info.reset_dump_reg_value[i]); >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> + return 0; >>>>>>>>> +} >>>>>>>>> + >>>>>>>>> int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev, >>>>>>>>> struct amdgpu_reset_context >>>>>>>>> *reset_context) >>>>>>>>> { >>>>>>>>> int i, r = 0; >>>>>>>>> struct amdgpu_job *job = NULL; >>>>>>>>> + struct amdgpu_device *tmp_adev = >>>>>>>>> reset_context->reset_req_dev; >>>>>>>>> bool need_full_reset = >>>>>>>>> test_bit(AMDGPU_NEED_FULL_RESET, >>>>>>>>> &reset_context->flags); >>>>>>>>> >>>>>>>>> @@ -5340,6 +5358,18 @@ int amdgpu_device_pre_asic_reset(struct >>>>>>>>> amdgpu_device *adev, >>>>>>>>> } >>>>>>>>> } >>>>>>>>> >>>>>>>>> + if (!test_bit(AMDGPU_SKIP_COREDUMP, >>>>>>>>> &reset_context->flags)) { >>>>>>>>> + amdgpu_reset_reg_dumps(tmp_adev); >>>>>>>>> + >>>>>>>>> + dev_info(tmp_adev->dev, "Dumping IP >>>>>>>>> State\n"); >>>>>>>>> + /* Trigger ip dump before we reset the >>>>>>>>> asic */ >>>>>>>>> + for (i = 0; i < >>>>>>>>> tmp_adev->num_ip_blocks; i++) >>>>>>>>> + if >>>>>>>>> (tmp_adev->ip_blocks[i].version->funcs->dump_ip_state) >>>>>>>>> + >>>>>>>>> tmp_adev->ip_blocks[i].version->funcs->dump_ip_state( >>>>>>>>> + (void >>>>>>>>> *)tmp_adev); >>>>>>>>> + dev_info(tmp_adev->dev, "Dumping IP State >>>>>>>>> Completed\n"); >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> if (need_full_reset) >>>>>>>>> r = amdgpu_device_ip_suspend(adev); >>>>>>>>> if (need_full_reset) >>>>>>>>> @@ -5352,47 +5382,17 @@ int amdgpu_device_pre_asic_reset(struct >>>>>>>>> amdgpu_device *adev, >>>>>>>>> return r; >>>>>>>>> } >>>>>>>>> >>>>>>>>> -static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev) >>>>>>>>> -{ >>>>>>>>> - int i; >>>>>>>>> - >>>>>>>>> - lockdep_assert_held(&adev->reset_domain->sem); >>>>>>>>> - >>>>>>>>> - for (i = 0; i < adev->reset_info.num_regs; i++) { >>>>>>>>> - adev->reset_info.reset_dump_reg_value[i] = >>>>>>>>> - >>>>>>>>> RREG32(adev->reset_info.reset_dump_reg_list[i]); >>>>>>>>> - >>>>>>>>> - >>>>>>>>> trace_amdgpu_reset_reg_dumps(adev->reset_info.reset_dump_reg_list[i], >>>>>>>>> - >>>>>>>>> adev->reset_info.reset_dump_reg_value[i]); >>>>>>>>> - } >>>>>>>>> - >>>>>>>>> - return 0; >>>>>>>>> -} >>>>>>>>> - >>>>>>>>> int amdgpu_do_asic_reset(struct list_head *device_list_handle, >>>>>>>>> struct amdgpu_reset_context >>>>>>>>> *reset_context) >>>>>>>>> { >>>>>>>>> struct amdgpu_device *tmp_adev = NULL; >>>>>>>>> bool need_full_reset, skip_hw_reset, vram_lost = false; >>>>>>>>> int r = 0; >>>>>>>>> - uint32_t i; >>>>>>>>> >>>>>>>>> /* Try reset handler method first */ >>>>>>>>> tmp_adev = list_first_entry(device_list_handle, struct >>>>>>>>> amdgpu_device, >>>>>>>>> reset_list); >>>>>>>>> >>>>>>>>> - if (!test_bit(AMDGPU_SKIP_COREDUMP, >>>>>>>>> &reset_context->flags)) { >>>>>>>>> - amdgpu_reset_reg_dumps(tmp_adev); >>>>>>>>> - >>>>>>>>> - dev_info(tmp_adev->dev, "Dumping IP State\n"); >>>>>>>>> - /* Trigger ip dump before we reset the asic */ >>>>>>>>> - for (i = 0; i < tmp_adev->num_ip_blocks; i++) >>>>>>>>> - if >>>>>>>>> (tmp_adev->ip_blocks[i].version->funcs->dump_ip_state) >>>>>>>>> - >>>>>>>>> tmp_adev->ip_blocks[i].version->funcs >>>>>>>>> - ->dump_ip_state((void *)tmp_adev); >>>>>>>>> - dev_info(tmp_adev->dev, "Dumping IP State >>>>>>>>> Completed\n"); >>>>>>>>> - } >>>>>>>>> - >>>>>>>>> reset_context->reset_device_list = device_list_handle; >>>>>>>>> r = amdgpu_reset_perform_reset(tmp_adev, >>>>>>>>> reset_context); >>>>>>>>> /* If reset handler not implemented, continue; >>>>>>>>> otherwise >>>>>>>>> return */ >>>>>>>>> -- >>>>>>>>> 2.34.1 >>>>>>>>>