[PATCH 4/4] drm/amdgpu: reset fpriv vram_lost_counter

david1.zhou@xxxxxxx (zhoucm1) · Wed, 17 May 2017 15:13:41 +0800



On 2017å¹´05æ??17æ?¥ 14:57, Michel DÃ¤nzer wrote:
> On 17/05/17 01:28 PM, zhoucm1 wrote:
>> On 2017å¹´05æ??17æ?¥ 11:15, Michel DÃ¤nzer wrote:
>>> On 17/05/17 12:04 PM, zhoucm1 wrote:
>>>> On 2017å¹´05æ??17æ?¥ 09:18, Michel DÃ¤nzer wrote:
>>>>> On 16/05/17 06:25 PM, Chunming Zhou wrote:
>>>>>> Change-Id: I8eb6d7f558da05510e429d3bf1d48c8cec6c1977
>>>>>> Signed-off-by: Chunming Zhou <David1.Zhou at amd.com>
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>> index bca1fb5..f3e7525 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>> @@ -2547,6 +2547,9 @@ int amdgpu_vm_ioctl(struct drm_device *dev,
>>>>>> void *data, struct drm_file *filp)
>>>>>>         case AMDGPU_VM_OP_UNRESERVE_VMID:
>>>>>>             amdgpu_vm_free_reserved_vmid(adev, &fpriv->vm,
>>>>>> AMDGPU_GFXHUB);
>>>>>>             break;
>>>>>> +    case AMDGPU_VM_OP_RESET:
>>>>>> +        fpriv->vram_lost_counter =
>>>>>> atomic_read(&adev->vram_lost_counter);
>>>>>> +        break;
>>>>> How do you envision the UMDs using this? I can mostly think of them
>>>>> calling this ioctl when a context is created or destroyed. But that
>>>>> would also allow any other remaining contexts using the same DRM file
>>>>> descriptor to use all ioctls again. So, I think there needs to be a
>>>>> vram_lost_counter in struct amdgpu_ctx instead of in struct
>>>>> amdgpu_fpriv.
>>>> struct amdgpu_fpriv for vram_lost_counter is proper place, especially
>>>> for ioctl return value.
>>>> if you need to reset ctx one by one, we can mark all contexts of that
>>>> vm, and then reset by userspace.
>>> I'm not following. With vram_lost_counter in amdgpu_fpriv, if any
>>> context calls this ioctl, all other contexts using the same file
>>> descriptor will also be considered safe again, right?
>> Yes, but it really depends on userspace requirement, if you need to
>> reset ctx one by one, we can mark all contexts of that vm to guilty, and
>> then reset one context by userspace.
> Still not sure what you mean by that.
>
> E.g. what do you mean by "guilty"? I thought that refers to the context
> which caused a hang. But it seems like you're using it to refer to any
> context which hasn't reacted yet to VRAM contents being lost.
When vram is lost, we treat all contexts need to reset.

Regards,
David Zhou
>
> Also not sure what you mean by "if you need to reset ctx one by one".
>
>