[PATCH 4/4] drm/amdgpu: reset fpriv vram_lost_counter

michel@xxxxxxxxxxx (Michel Dänzer) · Wed, 17 May 2017 17:55:01 +0900

On 17/05/17 05:46 PM, zhoucm1 wrote:
> 
> 
> On 2017å¹´05æ??17æ?¥ 16:40, Christian KÃ¶nig wrote:
>> Am 17.05.2017 um 10:01 schrieb Michel DÃ¤nzer:
>>> On 17/05/17 04:13 PM, zhoucm1 wrote:
>>>> On 2017å¹´05æ??17æ?¥ 14:57, Michel DÃ¤nzer wrote:
>>>>> On 17/05/17 01:28 PM, zhoucm1 wrote:
>>>>>> On 2017å¹´05æ??17æ?¥ 11:15, Michel DÃ¤nzer wrote:
>>>>>>> On 17/05/17 12:04 PM, zhoucm1 wrote:
>>>>>>>> On 2017å¹´05æ??17æ?¥ 09:18, Michel DÃ¤nzer wrote:
>>>>>>>>> On 16/05/17 06:25 PM, Chunming Zhou wrote:
>>>>>>>>>> Change-Id: I8eb6d7f558da05510e429d3bf1d48c8cec6c1977
>>>>>>>>>> Signed-off-by: Chunming Zhou <David1.Zhou at amd.com>
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>>>>>> index bca1fb5..f3e7525 100644
>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>>>>>> @@ -2547,6 +2547,9 @@ int amdgpu_vm_ioctl(struct drm_device *dev,
>>>>>>>>>> void *data, struct drm_file *filp)
>>>>>>>>>>          case AMDGPU_VM_OP_UNRESERVE_VMID:
>>>>>>>>>>              amdgpu_vm_free_reserved_vmid(adev, &fpriv->vm,
>>>>>>>>>> AMDGPU_GFXHUB);
>>>>>>>>>>              break;
>>>>>>>>>> +    case AMDGPU_VM_OP_RESET:
>>>>>>>>>> +        fpriv->vram_lost_counter =
>>>>>>>>>> atomic_read(&adev->vram_lost_counter);
>>>>>>>>>> +        break;
>>>>>>>>> How do you envision the UMDs using this? I can mostly think of
>>>>>>>>> them
>>>>>>>>> calling this ioctl when a context is created or destroyed. But
>>>>>>>>> that
>>>>>>>>> would also allow any other remaining contexts using the same
>>>>>>>>> DRM file
>>>>>>>>> descriptor to use all ioctls again. So, I think there needs to
>>>>>>>>> be a
>>>>>>>>> vram_lost_counter in struct amdgpu_ctx instead of in struct
>>>>>>>>> amdgpu_fpriv.
>>>>>>>> struct amdgpu_fpriv for vram_lost_counter is proper place,
>>>>>>>> especially
>>>>>>>> for ioctl return value.
>>>>>>>> if you need to reset ctx one by one, we can mark all contexts of
>>>>>>>> that
>>>>>>>> vm, and then reset by userspace.
>>>>>>> I'm not following. With vram_lost_counter in amdgpu_fpriv, if any
>>>>>>> context calls this ioctl, all other contexts using the same file
>>>>>>> descriptor will also be considered safe again, right?
>>>>>> Yes, but it really depends on userspace requirement, if you need to
>>>>>> reset ctx one by one, we can mark all contexts of that vm to
>>>>>> guilty, and
>>>>>> then reset one context by userspace.
>>>>> Still not sure what you mean by that.
>>>>>
>>>>> E.g. what do you mean by "guilty"? I thought that refers to the
>>>>> context
>>>>> which caused a hang. But it seems like you're using it to refer to any
>>>>> context which hasn't reacted yet to VRAM contents being lost.
>>>> When vram is lost, we treat all contexts need to reset.
>>> Essentially, your patches only track VRAM contents being lost per file
>>> descriptor, not per context. I'm not sure (rather skeptical) that this
>>> is suitable for OpenGL UMDs, since state is usually tracked per context.
>>> Marek / Nicolai?
>>
>> Oh, yeah that's a good point.
>>
>> The problem with tracking it per context is that Vulkan also wants the
>> ENODEV on the amdgpu_gem_va_ioct() and amdgpu_info_ioctl() which are
>> context less.
>>
>> But thinking more about this blocking those two doesn't make much
>> sense. The VM content can be restored and why should be disallow
>> reading GPU info?
> I can re-paste the Vulkan APIs requiring ENODEV:
> "
> 
> The Vulkan APIs listed below could return VK_ERROR_DEVICE_LOST according
> to the spec.
> 
> I tries to provide a list of u/k interfaces that could be called for
> each vk API.
> 
>  
> 
> vkCreateDevice
> 
> -          amdgpu_device_initialize.
> -          amdgpu_query_gpu_info

[...]

> vkQueueBindSparse**
> 
>                 amdgpu_bo_va_op
>                 amdgpu_bo_va_op_raw

So these *can* return VK_ERROR_DEVICE_LOST, but do they *have to* do
that after a reset? :)

-- 
Earthling Michel DÃ¤nzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer