Am 09.07.2018 um 09:48 schrieb Zhang, Jerry (Junwei): > On 07/09/2018 03:04 PM, Christian König wrote: >> Am 09.07.2018 um 07:13 schrieb Zhang, Jerry (Junwei): >>> On 07/06/2018 03:27 AM, Andrey Grodzovsky wrote: >>>> Extract and present the reposnsible process and thread when >>>> VM_FAULT happens. >>>> >>>> v2: Use getter and setter functions. >>>> >>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com> >>>> --- >>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++++ >>>>  drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 10 +++++++--- >>>>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 9 +++++++-- >>>>  3 files changed, 18 insertions(+), 5 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c >>>> index 7a625f3..609c8f5 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c >>>> @@ -187,6 +187,10 @@ static int amdgpu_cs_parser_init(struct >>>> amdgpu_cs_parser *p, void *data) >>>>      if (p->uf_entry.robj) >>>>          p->job->uf_addr = uf_offset; >>>>      kfree(chunk_array); >>>> + >>>> +   /* Use this opportunity to fill in task info for the vm */ >>>> +   amdgpu_vm_set_task_info(vm); >>>> + >>> >>> Shall we set the task info when vm init? >> >> No, vm_init() is called from a completely different process which is >> later on user of the VM. > > Originally I thought UMD opened DRI node and create a VM by vm_init(), > then every command submission > would be passed in the same VM initialized by vm_init(). > > So that's different process? The display server, e.g. X or Wayland. See with DRI3 the process of opening a connection to the hardware is that the display server open the file descriptor and with that calls vm_init. And then this file descriptor is passed to the client processes through IPC. Regards, Christian. > > >> >>> >>> >>>>      return 0; >>>> >>>>  free_all_kdata: >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c >>>> b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c >>>> index 08753e7..75f3ffb 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c >>>> @@ -46,7 +46,6 @@ >>>> >>>>  #include "ivsrcid/ivsrcid_vislands30.h" >>>> >>>> - >>>>  static void gmc_v8_0_set_gmc_funcs(struct amdgpu_device *adev); >>>>  static void gmc_v8_0_set_irq_funcs(struct amdgpu_device *adev); >>>>  static int gmc_v8_0_wait_for_idle(void *handle); >>>> @@ -1449,8 +1448,13 @@ static int gmc_v8_0_process_interrupt(struct >>>> amdgpu_device *adev, >>>>          gmc_v8_0_set_fault_enable_default(adev, false); >>>> >>>>      if (printk_ratelimit()) { >>>> -       dev_err(adev->dev, "GPU fault detected: %d 0x%08x\n", >>>> -           entry->src_id, entry->src_data[0]); >>>> +       struct amdgpu_task_info task_info = { 0 }; >>>> + >>>> +       amdgpu_vm_get_task_info(adev, entry->pasid, &task_info); >>> >>> Shall we find vm and get the vm->task_info directly instead of >>> filling local variable task_info? >>> (current style is also OK for me, just for more info) >> >> No, we can't guarantee that the task_info pointer in the VM won't be >> freed after dropping the spinlock. So returning a copy here is the >> better approach. >> >>> >>> And we may also check the return value for "NULL" case, otherwise it >>> may cause access errorin dev_err(), >>> if failed to find vm (although, it's most unlikely to happen). >> >> Since we use a copy of the task info we should never get a NULL >> pointer. The string should already be zero terminated with the "{ 0 >> }" initialization above. > > Thanks to explain that more. > Got it, fine for me. > > Jerry > >> >> Christian. >> >>> >>> Jerry >>> >>>> + >>>> +       dev_err(adev->dev, "GPU fault detected: %d 0x%08x for >>>> process %s pid %d thread %s pid %d\n", >>>> +           entry->src_id, entry->src_data[0], >>>> task_info.process_name, >>>> +           task_info.tgid, task_info.task_name, task_info.pid); >>>>          dev_err(adev->dev, " VM_CONTEXT1_PROTECTION_FAULT_ADDR >>>> 0x%08X\n", >>>>              addr); >>>>          dev_err(adev->dev, " VM_CONTEXT1_PROTECTION_FAULT_STATUS >>>> 0x%08X\n", >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c >>>> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c >>>> index 691a659..9df94b4 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c >>>> @@ -259,11 +259,16 @@ static int gmc_v9_0_process_interrupt(struct >>>> amdgpu_device *adev, >>>>      } >>>> >>>>      if (printk_ratelimit()) { >>>> +       struct amdgpu_task_info task_info = { 0 }; >>>> + >>>> +       amdgpu_vm_get_task_info(adev, entry->pasid, &task_info); >>>> + >>>>          dev_err(adev->dev, >>>> -           "[%s] VMC page fault (src_id:%u ring:%u vmid:%u >>>> pasid:%u)\n", >>>> +           "[%s] VMC page fault (src_id:%u ring:%u vmid:%u >>>> pasid:%u, for process %s pid %d thread %s pid %d\n)\n", >>>>              entry->vmid_src ? "mmhub" : "gfxhub", >>>>              entry->src_id, entry->ring_id, entry->vmid, >>>> -           entry->pasid); >>>> +           entry->pasid, task_info.process_name, task_info.tgid, >>>> +           task_info.task_name, task_info.pid); >>>>          dev_err(adev->dev, " at page 0x%016llx from %d\n", >>>>              addr, entry->client_id); >>>>          if (!amdgpu_sriov_vf(adev)) >>>> >>