[Public] > -----Original Message----- > From: Sunil Khatri <sunil.khatri@xxxxxxx> > Sent: Wednesday, March 6, 2024 1:20 PM > To: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian > <Christian.Koenig@xxxxxxx>; Sharma, Shashank > <Shashank.Sharma@xxxxxxx> > Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; dri-devel@xxxxxxxxxxxxxxxxxxxxx; linux- > kernel@xxxxxxxxxxxxxxx; Joshi, Mukul <Mukul.Joshi@xxxxxxx>; Paneer > Selvam, Arunpravin <Arunpravin.PaneerSelvam@xxxxxxx>; Khatri, Sunil > <Sunil.Khatri@xxxxxxx> > Subject: [PATCH] drm/amdgpu: add vm fault information to devcoredump > > Add page fault information to the devcoredump. > > Output of devcoredump: > **** AMDGPU Device Coredump **** > version: 1 > kernel: 6.7.0-amd-staging-drm-next > module: amdgpu > time: 29.725011811 > process_name: soft_recovery_p PID: 1720 > > Ring timed out details > IP Type: 0 Ring Name: gfx_0.0.0 > > [gfxhub] Page fault observed for GPU family:143 Faulty page starting at I think we should add a separate section for the GPU identification information (family, PCI ids, IP versions, etc.). For this patch, I think fine to just print the fault address and status. Alex > address 0x0000000000000000 Protection fault status register:0x301031 > > VRAM is lost due to GPU reset! > > Signed-off-by: Sunil Khatri <sunil.khatri@xxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 15 ++++++++++++++- > drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 + > 2 files changed, 15 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > index 147100c27c2d..d7fea6cdf2f9 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t > offset, size_t count, > coredump->ring->name); > } > > + if (coredump->fault_info.status) { > + struct amdgpu_vm_fault_info *fault_info = &coredump- > >fault_info; > + > + drm_printf(&p, "\n[%s] Page fault observed for GPU > family:%d\n", > + fault_info->vmhub ? "mmhub" : "gfxhub", > + coredump->adev->family); > + drm_printf(&p, "Faulty page starting at address 0x%016llx\n", > + fault_info->addr); > + drm_printf(&p, "Protection fault status register:0x%x\n", > + fault_info->status); > + } > + > if (coredump->reset_vram_lost) > - drm_printf(&p, "VRAM is lost due to GPU reset!\n"); > + drm_printf(&p, "\nVRAM is lost due to GPU reset!\n"); > if (coredump->adev->reset_info.num_regs) { > drm_printf(&p, "AMDGPU register dumps:\nOffset: > Value:\n"); > > @@ -253,6 +265,7 @@ void amdgpu_coredump(struct amdgpu_device > *adev, bool vram_lost, > if (job) { > s_job = &job->base; > coredump->ring = to_amdgpu_ring(s_job->sched); > + coredump->fault_info = job->vm->fault_info; > } > > coredump->adev = adev; > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h > index 60522963aaca..3197955264f9 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h > @@ -98,6 +98,7 @@ struct amdgpu_coredump_info { > struct timespec64 reset_time; > bool reset_vram_lost; > struct amdgpu_ring *ring; > + struct amdgpu_vm_fault_info fault_info; > }; > #endif > > -- > 2.34.1