[PATCH 4/4] drm/amdgpu: reset fpriv vram_lost_counter

maraeo@xxxxxxxxx (Marek Olšák) · Wed, 17 May 2017 11:49:38 +0200

David,

We already have a query that returns whether a device is lost. It's
called amdgpu_cs_query_reset_state. That should return whether a hang
was caused by a certain context or whether the hang happened but the
context is innocent. You can extend it to accept no context, in which
case it will return either NO_RESET (everything is OK) or
UNKNOWN_RESET (= when a hang happened but the caller didn't specify
the context).

Marek

On Wed, May 17, 2017 at 10:56 AM, Christian KÃ¶nig
<deathsimple at vodafone.de> wrote:
> Am 17.05.2017 um 10:46 schrieb zhoucm1:
>
>
>
> On 2017å¹´05æ??17æ?¥ 16:40, Christian KÃ¶nig wrote:
>
> Am 17.05.2017 um 10:01 schrieb Michel DÃ¤nzer:
>
> On 17/05/17 04:13 PM, zhoucm1 wrote:
>
> On 2017å¹´05æ??17æ?¥ 14:57, Michel DÃ¤nzer wrote:
>
> On 17/05/17 01:28 PM, zhoucm1 wrote:
>
> On 2017å¹´05æ??17æ?¥ 11:15, Michel DÃ¤nzer wrote:
>
> On 17/05/17 12:04 PM, zhoucm1 wrote:
>
> On 2017å¹´05æ??17æ?¥ 09:18, Michel DÃ¤nzer wrote:
>
> On 16/05/17 06:25 PM, Chunming Zhou wrote:
>
> Change-Id: I8eb6d7f558da05510e429d3bf1d48c8cec6c1977
> Signed-off-by: Chunming Zhou <David1.Zhou at amd.com>
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index bca1fb5..f3e7525 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -2547,6 +2547,9 @@ int amdgpu_vm_ioctl(struct drm_device *dev,
> void *data, struct drm_file *filp)
>          case AMDGPU_VM_OP_UNRESERVE_VMID:
>              amdgpu_vm_free_reserved_vmid(adev, &fpriv->vm,
> AMDGPU_GFXHUB);
>              break;
> +    case AMDGPU_VM_OP_RESET:
> +        fpriv->vram_lost_counter =
> atomic_read(&adev->vram_lost_counter);
> +        break;
>
> How do you envision the UMDs using this? I can mostly think of them
> calling this ioctl when a context is created or destroyed. But that
> would also allow any other remaining contexts using the same DRM file
> descriptor to use all ioctls again. So, I think there needs to be a
> vram_lost_counter in struct amdgpu_ctx instead of in struct
> amdgpu_fpriv.
>
> struct amdgpu_fpriv for vram_lost_counter is proper place, especially
> for ioctl return value.
> if you need to reset ctx one by one, we can mark all contexts of that
> vm, and then reset by userspace.
>
> I'm not following. With vram_lost_counter in amdgpu_fpriv, if any
> context calls this ioctl, all other contexts using the same file
> descriptor will also be considered safe again, right?
>
> Yes, but it really depends on userspace requirement, if you need to
> reset ctx one by one, we can mark all contexts of that vm to guilty, and
> then reset one context by userspace.
>
> Still not sure what you mean by that.
>
> E.g. what do you mean by "guilty"? I thought that refers to the context
> which caused a hang. But it seems like you're using it to refer to any
> context which hasn't reacted yet to VRAM contents being lost.
>
> When vram is lost, we treat all contexts need to reset.
>
> Essentially, your patches only track VRAM contents being lost per file
> descriptor, not per context. I'm not sure (rather skeptical) that this
> is suitable for OpenGL UMDs, since state is usually tracked per context.
> Marek / Nicolai?
>
>
> Oh, yeah that's a good point.
>
> The problem with tracking it per context is that Vulkan also wants the
> ENODEV on the amdgpu_gem_va_ioct() and amdgpu_info_ioctl() which are context
> less.
>
> But thinking more about this blocking those two doesn't make much sense. The
> VM content can be restored and why should be disallow reading GPU info?
>
> I can re-paste the Vulkan APIs requiring ENODEV:
> "
>
> The Vulkan APIs listed below could return VK_ERROR_DEVICE_LOST according to
> the spec.
>
> I tries to provide a list of u/k interfaces that could be called for each vk
> API.
>
>
> Well those are the Vulkan requirements, but that doesn't necessary mean we
> must follow that on the kernel side. Keep in mind that Vulkan can't made any
> requirements towards the kernel driver.
>
> IIRC we already have a query Vulkan can use to figure out if a GPU reset
> happened or not. So they could use that instead.
>
> Regards,
> Christian.
>
>
>
> vkCreateDevice
>
> -          amdgpu_device_initialize.
>
> -          amdgpu_query_gpu_info
>
>
>
> vkQueueSubmit
>
> -          amdgpu_cs_submit
>
>
>
> vkWaitForFences
>
>                 amdgpu_cs_wait_fences
>
>
>
> vkGetEventStatus
>
> vkQueueWaitIdle
>
> vkDeviceWaitIdle
>
> vkGetQueryPoolResults
>
>                 amdgpu_cs_query_Fence_status
>
>
>
> vkQueueBindSparse
>
>                 amdgpu_bo_va_op
>
>                 amdgpu_bo_va_op_raw
>
>
>
> vkCreateSwapchainKHR
>
> vkAcquireNextImageKHR
>
> vkQueuePresentKHR
>
>                 Not related with u/k interface.
>
>
>
> Besides those listed above, I think amdgpu_cs_signal_Sem/amdgpu_cs_wait_sem
> should respond to gpu reset as well."
>
>
> Christian.
>
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>