[PATCH 4/4] drm/amdgpu: reset fpriv vram_lost_counter

david1.zhou@xxxxxxx (zhoucm1) · Wed, 17 May 2017 16:46:51 +0800

On 2017å¹´05æ??17æ?¥ 16:40, Christian KÃ¶nig wrote:
> Am 17.05.2017 um 10:01 schrieb Michel DÃ¤nzer:
>> On 17/05/17 04:13 PM, zhoucm1 wrote:
>>> On 2017å¹´05æ??17æ?¥ 14:57, Michel DÃ¤nzer wrote:
>>>> On 17/05/17 01:28 PM, zhoucm1 wrote:
>>>>> On 2017å¹´05æ??17æ?¥ 11:15, Michel DÃ¤nzer wrote:
>>>>>> On 17/05/17 12:04 PM, zhoucm1 wrote:
>>>>>>> On 2017å¹´05æ??17æ?¥ 09:18, Michel DÃ¤nzer wrote:
>>>>>>>> On 16/05/17 06:25 PM, Chunming Zhou wrote:
>>>>>>>>> Change-Id: I8eb6d7f558da05510e429d3bf1d48c8cec6c1977
>>>>>>>>> Signed-off-by: Chunming Zhou <David1.Zhou at amd.com>
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>>>>> index bca1fb5..f3e7525 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>>>>> @@ -2547,6 +2547,9 @@ int amdgpu_vm_ioctl(struct drm_device *dev,
>>>>>>>>> void *data, struct drm_file *filp)
>>>>>>>>>          case AMDGPU_VM_OP_UNRESERVE_VMID:
>>>>>>>>>              amdgpu_vm_free_reserved_vmid(adev, &fpriv->vm,
>>>>>>>>> AMDGPU_GFXHUB);
>>>>>>>>>              break;
>>>>>>>>> +    case AMDGPU_VM_OP_RESET:
>>>>>>>>> +        fpriv->vram_lost_counter =
>>>>>>>>> atomic_read(&adev->vram_lost_counter);
>>>>>>>>> +        break;
>>>>>>>> How do you envision the UMDs using this? I can mostly think of 
>>>>>>>> them
>>>>>>>> calling this ioctl when a context is created or destroyed. But 
>>>>>>>> that
>>>>>>>> would also allow any other remaining contexts using the same 
>>>>>>>> DRM file
>>>>>>>> descriptor to use all ioctls again. So, I think there needs to 
>>>>>>>> be a
>>>>>>>> vram_lost_counter in struct amdgpu_ctx instead of in struct
>>>>>>>> amdgpu_fpriv.
>>>>>>> struct amdgpu_fpriv for vram_lost_counter is proper place, 
>>>>>>> especially
>>>>>>> for ioctl return value.
>>>>>>> if you need to reset ctx one by one, we can mark all contexts of 
>>>>>>> that
>>>>>>> vm, and then reset by userspace.
>>>>>> I'm not following. With vram_lost_counter in amdgpu_fpriv, if any
>>>>>> context calls this ioctl, all other contexts using the same file
>>>>>> descriptor will also be considered safe again, right?
>>>>> Yes, but it really depends on userspace requirement, if you need to
>>>>> reset ctx one by one, we can mark all contexts of that vm to 
>>>>> guilty, and
>>>>> then reset one context by userspace.
>>>> Still not sure what you mean by that.
>>>>
>>>> E.g. what do you mean by "guilty"? I thought that refers to the 
>>>> context
>>>> which caused a hang. But it seems like you're using it to refer to any
>>>> context which hasn't reacted yet to VRAM contents being lost.
>>> When vram is lost, we treat all contexts need to reset.
>> Essentially, your patches only track VRAM contents being lost per file
>> descriptor, not per context. I'm not sure (rather skeptical) that this
>> is suitable for OpenGL UMDs, since state is usually tracked per context.
>> Marek / Nicolai?
>
> Oh, yeah that's a good point.
>
> The problem with tracking it per context is that Vulkan also wants the 
> ENODEV on the amdgpu_gem_va_ioct() and amdgpu_info_ioctl() which are 
> context less.
>
> But thinking more about this blocking those two doesn't make much 
> sense. The VM content can be restored and why should be disallow 
> reading GPU info?
I can re-paste the Vulkan APIs requiring ENODEV:
"

The Vulkan APIs listed below could return VK_ERROR_DEVICE_LOST according 
to the spec.

I tries to provide a list of u/k interfaces that could be called for 
each vk API.

vkCreateDevice

-amdgpu_device_initialize.

-amdgpu_query_gpu_info

vkQueueSubmit

-amdgpu_cs_submit

vkWaitForFences

                 amdgpu_cs_wait_fences

vkGetEventStatus

vkQueueWaitIdle

vkDeviceWaitIdle

vkGetQueryPoolResults**

                 amdgpu_cs_query_Fence_status

vkQueueBindSparse**

                 amdgpu_bo_va_op

                 amdgpu_bo_va_op_raw

vkCreateSwapchainKHR**

vkAcquireNextImageKHR**

vkQueuePresentKHR

                 Not related with u/k interface.**

**

Besides those listed above, I think 
amdgpu_cs_signal_Sem/amdgpu_cs_wait_sem should respond to gpu reset as 
well."
>
> Christian.
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20170517/3f3907c7/attachment.html>