[PATCH v3 2/2] drm/admgpu: Present amdgpu_task_info in VM_FAULTS.

christian.koenig@xxxxxxx (Christian König) · Mon, 9 Jul 2018 10:55:43 +0200

Am 09.07.2018 um 09:48 schrieb Zhang, Jerry (Junwei):
> On 07/09/2018 03:04 PM, Christian KÃ¶nig wrote:
>> Am 09.07.2018 um 07:13 schrieb Zhang, Jerry (Junwei):
>>> On 07/06/2018 03:27 AM, Andrey Grodzovsky wrote:
>>>> Extract and present the reposnsible process and thread when
>>>> VM_FAULT happens.
>>>>
>>>> v2: Use getter and setter functions.
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>> ---
>>>> Â  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c |Â  4 ++++
>>>> Â  drivers/gpu/drm/amd/amdgpu/gmc_v8_0.cÂ  | 10 +++++++---
>>>> Â  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.cÂ  |Â  9 +++++++--
>>>> Â  3 files changed, 18 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> index 7a625f3..609c8f5 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> @@ -187,6 +187,10 @@ static int amdgpu_cs_parser_init(struct 
>>>> amdgpu_cs_parser *p, void *data)
>>>> Â Â Â Â Â  if (p->uf_entry.robj)
>>>> Â Â Â Â Â Â Â Â Â  p->job->uf_addr = uf_offset;
>>>> Â Â Â Â Â  kfree(chunk_array);
>>>> +
>>>> +Â Â Â  /* Use this opportunity to fill in task info for the vm */
>>>> +Â Â Â  amdgpu_vm_set_task_info(vm);
>>>> +
>>>
>>> Shall we set the task info when vm init?
>>
>> No, vm_init() is called from a completely different process which is 
>> later on user of the VM.
>
> Originally I thought UMD opened DRI node and create a VM by vm_init(), 
> then every command submission
> would be passed in the same VM initialized by vm_init().
>
> So that's different process?

The display server, e.g. X or Wayland.

See with DRI3 the process of opening a connection to the hardware is 
that the display server open the file descriptor and with that calls 
vm_init.

And then this file descriptor is passed to the client processes through IPC.

Regards,
Christian.

>
>
>>
>>>
>>>
>>>> Â Â Â Â Â  return 0;
>>>>
>>>> Â  free_all_kdata:
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
>>>> index 08753e7..75f3ffb 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
>>>> @@ -46,7 +46,6 @@
>>>>
>>>> Â  #include "ivsrcid/ivsrcid_vislands30.h"
>>>>
>>>> -
>>>> Â  static void gmc_v8_0_set_gmc_funcs(struct amdgpu_device *adev);
>>>> Â  static void gmc_v8_0_set_irq_funcs(struct amdgpu_device *adev);
>>>> Â  static int gmc_v8_0_wait_for_idle(void *handle);
>>>> @@ -1449,8 +1448,13 @@ static int gmc_v8_0_process_interrupt(struct 
>>>> amdgpu_device *adev,
>>>> Â Â Â Â Â Â Â Â Â  gmc_v8_0_set_fault_enable_default(adev, false);
>>>>
>>>> Â Â Â Â Â  if (printk_ratelimit()) {
>>>> -Â Â Â Â Â Â Â  dev_err(adev->dev, "GPU fault detected: %d 0x%08x\n",
>>>> -Â Â Â Â Â Â Â Â Â Â Â  entry->src_id, entry->src_data[0]);
>>>> +Â Â Â Â Â Â Â  struct amdgpu_task_info task_info = { 0 };
>>>> +
>>>> +Â Â Â Â Â Â Â  amdgpu_vm_get_task_info(adev, entry->pasid, &task_info);
>>>
>>> Shall we find vm and get the vm->task_info directly instead of 
>>> filling local variable task_info?
>>> (current style is also OK for me, just for more info)
>>
>> No, we can't guarantee that the task_info pointer in the VM won't be 
>> freed after dropping the spinlock. So returning a copy here is the 
>> better approach.
>>
>>>
>>> And we may also check the return value for "NULL" case, otherwise it 
>>> may cause access errorin dev_err(),
>>> if failed to find vm (although, it's most unlikely to happen).
>>
>> Since we use a copy of the task info we should never get a NULL 
>> pointer. The string should already be zero terminated with the "{ 0 
>> }" initialization above.
>
> Thanks to explain that more.
> Got it, fine for me.
>
> Jerry
>
>>
>> Christian.
>>
>>>
>>> Jerry
>>>
>>>> +
>>>> +Â Â Â Â Â Â Â  dev_err(adev->dev, "GPU fault detected: %d 0x%08x for 
>>>> process %s pid %d thread %s pid %d\n",
>>>> +Â Â Â Â Â Â Â Â Â Â Â  entry->src_id, entry->src_data[0], 
>>>> task_info.process_name,
>>>> +Â Â Â Â Â Â Â Â Â Â Â  task_info.tgid, task_info.task_name, task_info.pid);
>>>> Â Â Â Â Â Â Â Â Â  dev_err(adev->dev, " VM_CONTEXT1_PROTECTION_FAULT_ADDR 
>>>> 0x%08X\n",
>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â  addr);
>>>> Â Â Â Â Â Â Â Â Â  dev_err(adev->dev, " VM_CONTEXT1_PROTECTION_FAULT_STATUS 
>>>> 0x%08X\n",
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>>> index 691a659..9df94b4 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>>> @@ -259,11 +259,16 @@ static int gmc_v9_0_process_interrupt(struct 
>>>> amdgpu_device *adev,
>>>> Â Â Â Â Â  }
>>>>
>>>> Â Â Â Â Â  if (printk_ratelimit()) {
>>>> +Â Â Â Â Â Â Â  struct amdgpu_task_info task_info = { 0 };
>>>> +
>>>> +Â Â Â Â Â Â Â  amdgpu_vm_get_task_info(adev, entry->pasid, &task_info);
>>>> +
>>>> Â Â Â Â Â Â Â Â Â  dev_err(adev->dev,
>>>> -Â Â Â Â Â Â Â Â Â Â Â  "[%s] VMC page fault (src_id:%u ring:%u vmid:%u 
>>>> pasid:%u)\n",
>>>> +Â Â Â Â Â Â Â Â Â Â Â  "[%s] VMC page fault (src_id:%u ring:%u vmid:%u 
>>>> pasid:%u, for process %s pid %d thread %s pid %d\n)\n",
>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â  entry->vmid_src ? "mmhub" : "gfxhub",
>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â  entry->src_id, entry->ring_id, entry->vmid,
>>>> -Â Â Â Â Â Â Â Â Â Â Â  entry->pasid);
>>>> +Â Â Â Â Â Â Â Â Â Â Â  entry->pasid, task_info.process_name, task_info.tgid,
>>>> +Â Â Â Â Â Â Â Â Â Â Â  task_info.task_name, task_info.pid);
>>>> Â Â Â Â Â Â Â Â Â  dev_err(adev->dev, "Â  at page 0x%016llx from %d\n",
>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â  addr, entry->client_id);
>>>> Â Â Â Â Â Â Â Â Â  if (!amdgpu_sriov_vf(adev))
>>>>
>>