On 07/09/2018 03:04 PM, Christian König wrote: > Am 09.07.2018 um 07:13 schrieb Zhang, Jerry (Junwei): >> On 07/06/2018 03:27 AM, Andrey Grodzovsky wrote: >>> Extract and present the reposnsible process and thread when >>> VM_FAULT happens. >>> >>> v2: Use getter and setter functions. >>> >>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++++ >>> drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 10 +++++++--- >>> drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 9 +++++++-- >>> 3 files changed, 18 insertions(+), 5 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c >>> index 7a625f3..609c8f5 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c >>> @@ -187,6 +187,10 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data) >>> if (p->uf_entry.robj) >>> p->job->uf_addr = uf_offset; >>> kfree(chunk_array); >>> + >>> + /* Use this opportunity to fill in task info for the vm */ >>> + amdgpu_vm_set_task_info(vm); >>> + >> >> Shall we set the task info when vm init? > > No, vm_init() is called from a completely different process which is later on user of the VM. Originally I thought UMD opened DRI node and create a VM by vm_init(), then every command submission would be passed in the same VM initialized by vm_init(). So that's different process? > >> >> >>> return 0; >>> >>> free_all_kdata: >>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c >>> index 08753e7..75f3ffb 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c >>> @@ -46,7 +46,6 @@ >>> >>> #include "ivsrcid/ivsrcid_vislands30.h" >>> >>> - >>> static void gmc_v8_0_set_gmc_funcs(struct amdgpu_device *adev); >>> static void gmc_v8_0_set_irq_funcs(struct amdgpu_device *adev); >>> static int gmc_v8_0_wait_for_idle(void *handle); >>> @@ -1449,8 +1448,13 @@ static int gmc_v8_0_process_interrupt(struct amdgpu_device *adev, >>> gmc_v8_0_set_fault_enable_default(adev, false); >>> >>> if (printk_ratelimit()) { >>> - dev_err(adev->dev, "GPU fault detected: %d 0x%08x\n", >>> - entry->src_id, entry->src_data[0]); >>> + struct amdgpu_task_info task_info = { 0 }; >>> + >>> + amdgpu_vm_get_task_info(adev, entry->pasid, &task_info); >> >> Shall we find vm and get the vm->task_info directly instead of filling local variable task_info? >> (current style is also OK for me, just for more info) > > No, we can't guarantee that the task_info pointer in the VM won't be freed after dropping the spinlock. So returning a copy here is the better approach. > >> >> And we may also check the return value for "NULL" case, otherwise it may cause access errorin dev_err(), >> if failed to find vm (although, it's most unlikely to happen). > > Since we use a copy of the task info we should never get a NULL pointer. The string should already be zero terminated with the "{ 0 }" initialization above. Thanks to explain that more. Got it, fine for me. Jerry > > Christian. > >> >> Jerry >> >>> + >>> + dev_err(adev->dev, "GPU fault detected: %d 0x%08x for process %s pid %d thread %s pid %d\n", >>> + entry->src_id, entry->src_data[0], task_info.process_name, >>> + task_info.tgid, task_info.task_name, task_info.pid); >>> dev_err(adev->dev, " VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x%08X\n", >>> addr); >>> dev_err(adev->dev, " VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x%08X\n", >>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c >>> index 691a659..9df94b4 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c >>> @@ -259,11 +259,16 @@ static int gmc_v9_0_process_interrupt(struct amdgpu_device *adev, >>> } >>> >>> if (printk_ratelimit()) { >>> + struct amdgpu_task_info task_info = { 0 }; >>> + >>> + amdgpu_vm_get_task_info(adev, entry->pasid, &task_info); >>> + >>> dev_err(adev->dev, >>> - "[%s] VMC page fault (src_id:%u ring:%u vmid:%u pasid:%u)\n", >>> + "[%s] VMC page fault (src_id:%u ring:%u vmid:%u pasid:%u, for process %s pid %d thread %s pid %d\n)\n", >>> entry->vmid_src ? "mmhub" : "gfxhub", >>> entry->src_id, entry->ring_id, entry->vmid, >>> - entry->pasid); >>> + entry->pasid, task_info.process_name, task_info.tgid, >>> + task_info.task_name, task_info.pid); >>> dev_err(adev->dev, " at page 0x%016llx from %d\n", >>> addr, entry->client_id); >>> if (!amdgpu_sriov_vf(adev)) >>> >