On Thu, Mar 7, 2024 at 3:42 PM Khatri, Sunil <sukhatri@xxxxxxx> wrote: > > > On 3/8/2024 12:44 AM, Alex Deucher wrote: > > On Thu, Mar 7, 2024 at 12:00 PM Sunil Khatri <sunil.khatri@xxxxxxx> wrote: > >> Add page fault information to the devcoredump. > >> > >> Output of devcoredump: > >> **** AMDGPU Device Coredump **** > >> version: 1 > >> kernel: 6.7.0-amd-staging-drm-next > >> module: amdgpu > >> time: 29.725011811 > >> process_name: soft_recovery_p PID: 1720 > >> > >> Ring timed out details > >> IP Type: 0 Ring Name: gfx_0.0.0 > >> > >> [gfxhub] Page fault observed > >> Faulty page starting at address 0x0000000000000000 > > Do you want a : before the address for consistency? > sure. > > > >> Protection fault status register:0x301031 > > How about a space after the : for consistency? > > > > For parsability, it may make more sense to just have a list of key value pairs: > > [GPU page fault] > > hub: > > addr: > > status: > > [Ring timeout details] > > IP: > > ring: > > name: > > > > etc. > > Sure i agree but till now i was capturing information like we shared in > dmesg which is user readable. But surely one we have enough data i could > arrange all in key: value pairs like you suggest in a patch later if > that works ? Sure. Alex > > > > >> VRAM is lost due to GPU reset! > >> > >> Signed-off-by: Sunil Khatri <sunil.khatri@xxxxxxx> > >> --- > >> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +++++++++++++- > >> 1 file changed, 13 insertions(+), 1 deletion(-) > >> > >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > >> index 147100c27c2d..dd39e614d907 100644 > >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > >> @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, > >> coredump->ring->name); > >> } > >> > >> + if (coredump->adev) { > >> + struct amdgpu_vm_fault_info *fault_info = > >> + &coredump->adev->vm_manager.fault_info; > >> + > >> + drm_printf(&p, "\n[%s] Page fault observed\n", > >> + fault_info->vmhub ? "mmhub" : "gfxhub"); > >> + drm_printf(&p, "Faulty page starting at address 0x%016llx\n", > >> + fault_info->addr); > >> + drm_printf(&p, "Protection fault status register:0x%x\n", > >> + fault_info->status); > >> + } > >> + > >> if (coredump->reset_vram_lost) > >> - drm_printf(&p, "VRAM is lost due to GPU reset!\n"); > >> + drm_printf(&p, "\nVRAM is lost due to GPU reset!\n"); > >> if (coredump->adev->reset_info.num_regs) { > >> drm_printf(&p, "AMDGPU register dumps:\nOffset: Value:\n"); > >> > >> -- > >> 2.34.1 > >>