On Tue, Aug 20, 2024 at 11:07 AM Khatri, Sunil <sunil.khatri@xxxxxxx> wrote: > > > On 8/20/2024 7:36 PM, Alex Deucher wrote: > > On Tue, Aug 20, 2024 at 3:30 AM Huang, Trigger <Trigger.Huang@xxxxxxx> wrote: > >> [AMD Official Use Only - AMD Internal Distribution Only] > >> > >>> -----Original Message----- > >>> From: Khatri, Sunil <Sunil.Khatri@xxxxxxx> > >>> Sent: Monday, August 19, 2024 6:31 PM > >>> To: Huang, Trigger <Trigger.Huang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > >>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx> > >>> Subject: Re: [PATCH 2/2] drm/amdgpu: Do core dump immediately when job > >>> tmo > >>> > >>> > >>> On 8/19/2024 3:23 PM, Trigger.Huang@xxxxxxx wrote: > >>>> From: Trigger Huang <Trigger.Huang@xxxxxxx> > >>>> > >>>> Do the coredump immediately after a job timeout to get a closer > >>>> representation of GPU's error status. > >>>> > >>>> V2: This will skip printing vram_lost as the GPU reset is not happened > >>>> yet (Alex) > >>>> > >>>> V3: Unconditionally call the core dump as we care about all the reset > >>>> functions(soft-recovery and queue reset and full adapter reset, Alex) > >>>> > >>>> Signed-off-by: Trigger Huang <Trigger.Huang@xxxxxxx> > >>>> --- > >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 62 > >>> +++++++++++++++++++++++++ > >>>> 1 file changed, 62 insertions(+) > >>>> > >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > >>>> index c6a1783fc9ef..ebbb1434073e 100644 > >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > >>>> @@ -30,6 +30,61 @@ > >>>> #include "amdgpu.h" > >>>> #include "amdgpu_trace.h" > >>>> #include "amdgpu_reset.h" > >>>> +#include "amdgpu_dev_coredump.h" > >>>> +#include "amdgpu_xgmi.h" > >>>> + > >>>> +static void amdgpu_job_do_core_dump(struct amdgpu_device *adev, > >>>> + struct amdgpu_job *job) > >>>> +{ > >>>> + int i; > >>>> + > >>>> + dev_info(adev->dev, "Dumping IP State\n"); > >>>> + for (i = 0; i < adev->num_ip_blocks; i++) { > >>>> + if (adev->ip_blocks[i].version->funcs->dump_ip_state) > >>>> + adev->ip_blocks[i].version->funcs > >>>> + ->dump_ip_state((void *)adev); > >>>> + dev_info(adev->dev, "Dumping IP State Completed\n"); > >>>> + } > >>>> + > >>>> + amdgpu_coredump(adev, true, false, job); } > >>>> + > >>>> +static void amdgpu_job_core_dump(struct amdgpu_device *adev, > >>>> + struct amdgpu_job *job) > >>>> +{ > >>>> + struct list_head device_list, *device_list_handle = NULL; > >>>> + struct amdgpu_device *tmp_adev = NULL; > >>>> + struct amdgpu_hive_info *hive = NULL; > >>>> + > >>>> + if (!amdgpu_sriov_vf(adev)) > >>>> + hive = amdgpu_get_xgmi_hive(adev); > >>>> + if (hive) > >>>> + mutex_lock(&hive->hive_lock); > >>>> + /* > >>>> + * Reuse the logic in amdgpu_device_gpu_recover() to build list of > >>>> + * devices for code dump > >>>> + */ > >>>> + INIT_LIST_HEAD(&device_list); > >>>> + if (!amdgpu_sriov_vf(adev) && (adev- > >>>> gmc.xgmi.num_physical_nodes > 1) && hive) { > >>>> + list_for_each_entry(tmp_adev, &hive->device_list, > >>> gmc.xgmi.head) > >>>> + list_add_tail(&tmp_adev->reset_list, &device_list); > >>>> + if (!list_is_first(&adev->reset_list, &device_list)) > >>>> + list_rotate_to_front(&adev->reset_list, &device_list); > >>>> + device_list_handle = &device_list; > >>>> + } else { > >>>> + list_add_tail(&adev->reset_list, &device_list); > >>>> + device_list_handle = &device_list; > >>>> + } > >>>> + > >>>> + /* Do the coredump for each device */ > >>>> + list_for_each_entry(tmp_adev, device_list_handle, reset_list) > >>>> + amdgpu_job_do_core_dump(tmp_adev, job); > >>>> + > >>>> + if (hive) { > >>>> + mutex_unlock(&hive->hive_lock); > >>>> + amdgpu_put_xgmi_hive(hive); > >>>> + } > >>>> +} > >>>> > >>>> static enum drm_gpu_sched_stat amdgpu_job_timedout(struct > >>> drm_sched_job *s_job) > >>>> { > >>>> @@ -48,6 +103,7 @@ static enum drm_gpu_sched_stat > >>> amdgpu_job_timedout(struct drm_sched_job *s_job) > >>>> return DRM_GPU_SCHED_STAT_ENODEV; > >>>> } > >>>> > >>>> + amdgpu_job_core_dump(adev, job); > >>> The philosophy is hang and recovery is to let the HW and software try to > >>> recover. Here we try to do a soft recovery first and i think we should wait for > >>> seft recovery and if fails then we do dump and thats exactly we are doing here. > >> Hi Sunil , > >> thanks for the suggestion, and that's reasonable. But my concern is that after soft recovery happened, the GPU's status may change(take gfx 9 for example, it will try to kill the current hang wave) > >> Actually, in most cases, a real shader hang cannot be resolved through soft recovery, and at that moment, we need to get a very close dump/snapshot/representation of GPU's current error status. > >> Just like the scandump, when we trying to do a scandump for a shader hang, we will disable gpu_recovery, and no soft recovery/per-queue reset/HW reset will happen before the scandump, right? > >> On most products, there are no scandump interfaces, so core dump is even more important for debugging GPU hang issue. > > I agree that it makes sense to take the dump as early as possible. It > > makes sense to move it up now that we are starting to have finer > > grained resets. We may want to wire devcoredump into the KFD side as > > well at some point. > In the current implementation we do stop the Scheduler and wait for > existing fences to complete or signal them. But the new place to dump > will not have anything like that and even though we have timeout some of > the threads or waves might be still progressing the IP dump might be > changing during that time. Sure, but that is all sw state. It shouldn't affect the hw state. Or if it does, it would be indirect. E.g., signalling a fence might unblock a dependent job, but the root issue would still be valid. > > But i am not sure if we want to stop the scheduling of new tasks and end > the existing one. How about we have multiple level of dumps i.e just > capture and not dump as first dump is what is captured not last. > a. Capture after soft reset and let it be overwritten by next level > b. After soft reset fails capture again before hard reset is triggered. > > So eventually we would have the ip dump file generated of what we are > interested in. It might make more sense to slice and dice the dumps more finely, but I think we'll need to see how things work out as is before we worry about those details. Alex > > Regards > Sunil K > > > > > > > Alex > > > >> Regards, > >> Trigger > >> > >>> Also we need to make sure that the tasks which are already in queue are put > >>> on hold and the their sync points are signalled before we dump. > >>> check once what all steps are taken before we dump in the current > >>> implementation. > >> Do you mean sometimes like: > >> drm_sched_wqueue_stop(&ring->sched); > >> amdgpu_fence_driver_force_completion(ring); // Since there is no GPU reset happened, is it reasonable to call it here? > >> amdgpu_job_core_dump(adev, job); > >> > >> > >> Regards, > >> Trigger > >> > >>> Regards > >>> > >>> Sunil khatri > >>> > >>>> adev->job_hang = true; > >>>> > >>>> @@ -101,6 +157,12 @@ static enum drm_gpu_sched_stat > >>> amdgpu_job_timedout(struct drm_sched_job *s_job) > >>>> reset_context.src = AMDGPU_RESET_SRC_JOB; > >>>> clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags); > >>>> > >>>> + /* > >>>> + * To avoid an unnecessary extra coredump, as we have > >>> already > >>>> + * got the very close representation of GPU's error status > >>>> + */ > >>>> + set_bit(AMDGPU_SKIP_COREDUMP, &reset_context.flags); > >>>> + > >>>> r = amdgpu_device_gpu_recover(ring->adev, job, > >>> &reset_context); > >>>> if (r) > >>>> dev_err(adev->dev, "GPU Recovery Failed: %d\n", r);