[PATCH 4/4] drm/amdgpu: reset fpriv vram_lost_counter

David1.Zhou@xxxxxxx (Zhou, David(ChunMing)) · Wed, 17 May 2017 10:15:34 +0000

Yes, agree, it's easy to extend it as your like.
For innocent context, which would need us to find out the context of guilty job, which is TODO.

Regards,
David Zhou

-----Original Message-----
From: Marek OlÅ¡Ã¡k [mailto:maraeo@xxxxxxxxx] 
Sent: Wednesday, May 17, 2017 5:50 PM
To: Christian KÃ¶nig <deathsimple at vodafone.de>
Cc: Zhou, David(ChunMing) <David1.Zhou at amd.com>; Michel DÃ¤nzer <michel at daenzer.net>; amd-gfx mailing list <amd-gfx at lists.freedesktop.org>
Subject: Re: [PATCH 4/4] drm/amdgpu: reset fpriv vram_lost_counter

David,

We already have a query that returns whether a device is lost. It's called amdgpu_cs_query_reset_state. That should return whether a hang was caused by a certain context or whether the hang happened but the context is innocent. You can extend it to accept no context, in which case it will return either NO_RESET (everything is OK) or UNKNOWN_RESET (= when a hang happened but the caller didn't specify the context).

Marek

On Wed, May 17, 2017 at 10:56 AM, Christian KÃ¶nig <deathsimple at vodafone.de> wrote:
> Am 17.05.2017 um 10:46 schrieb zhoucm1:
>
>
>
> On 2017å¹´05æ??17æ?¥ 16:40, Christian KÃ¶nig wrote:
>
> Am 17.05.2017 um 10:01 schrieb Michel DÃ¤nzer:
>
> On 17/05/17 04:13 PM, zhoucm1 wrote:
>
> On 2017å¹´05æ??17æ?¥ 14:57, Michel DÃ¤nzer wrote:
>
> On 17/05/17 01:28 PM, zhoucm1 wrote:
>
> On 2017å¹´05æ??17æ?¥ 11:15, Michel DÃ¤nzer wrote:
>
> On 17/05/17 12:04 PM, zhoucm1 wrote:
>
> On 2017å¹´05æ??17æ?¥ 09:18, Michel DÃ¤nzer wrote:
>
> On 16/05/17 06:25 PM, Chunming Zhou wrote:
>
> Change-Id: I8eb6d7f558da05510e429d3bf1d48c8cec6c1977
> Signed-off-by: Chunming Zhou <David1.Zhou at amd.com>
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index bca1fb5..f3e7525 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -2547,6 +2547,9 @@ int amdgpu_vm_ioctl(struct drm_device *dev, void 
> *data, struct drm_file *filp)
>          case AMDGPU_VM_OP_UNRESERVE_VMID:
>              amdgpu_vm_free_reserved_vmid(adev, &fpriv->vm, 
> AMDGPU_GFXHUB);
>              break;
> +    case AMDGPU_VM_OP_RESET:
> +        fpriv->vram_lost_counter =
> atomic_read(&adev->vram_lost_counter);
> +        break;
>
> How do you envision the UMDs using this? I can mostly think of them 
> calling this ioctl when a context is created or destroyed. But that 
> would also allow any other remaining contexts using the same DRM file 
> descriptor to use all ioctls again. So, I think there needs to be a 
> vram_lost_counter in struct amdgpu_ctx instead of in struct 
> amdgpu_fpriv.
>
> struct amdgpu_fpriv for vram_lost_counter is proper place, especially 
> for ioctl return value.
> if you need to reset ctx one by one, we can mark all contexts of that 
> vm, and then reset by userspace.
>
> I'm not following. With vram_lost_counter in amdgpu_fpriv, if any 
> context calls this ioctl, all other contexts using the same file 
> descriptor will also be considered safe again, right?
>
> Yes, but it really depends on userspace requirement, if you need to 
> reset ctx one by one, we can mark all contexts of that vm to guilty, 
> and then reset one context by userspace.
>
> Still not sure what you mean by that.
>
> E.g. what do you mean by "guilty"? I thought that refers to the 
> context which caused a hang. But it seems like you're using it to 
> refer to any context which hasn't reacted yet to VRAM contents being lost.
>
> When vram is lost, we treat all contexts need to reset.
>
> Essentially, your patches only track VRAM contents being lost per file 
> descriptor, not per context. I'm not sure (rather skeptical) that this 
> is suitable for OpenGL UMDs, since state is usually tracked per context.
> Marek / Nicolai?
>
>
> Oh, yeah that's a good point.
>
> The problem with tracking it per context is that Vulkan also wants the 
> ENODEV on the amdgpu_gem_va_ioct() and amdgpu_info_ioctl() which are 
> context less.
>
> But thinking more about this blocking those two doesn't make much 
> sense. The VM content can be restored and why should be disallow reading GPU info?
>
> I can re-paste the Vulkan APIs requiring ENODEV:
> "
>
> The Vulkan APIs listed below could return VK_ERROR_DEVICE_LOST 
> according to the spec.
>
> I tries to provide a list of u/k interfaces that could be called for 
> each vk API.
>
>
> Well those are the Vulkan requirements, but that doesn't necessary 
> mean we must follow that on the kernel side. Keep in mind that Vulkan 
> can't made any requirements towards the kernel driver.
>
> IIRC we already have a query Vulkan can use to figure out if a GPU 
> reset happened or not. So they could use that instead.
>
> Regards,
> Christian.
>
>
>
> vkCreateDevice
>
> -          amdgpu_device_initialize.
>
> -          amdgpu_query_gpu_info
>
>
>
> vkQueueSubmit
>
> -          amdgpu_cs_submit
>
>
>
> vkWaitForFences
>
>                 amdgpu_cs_wait_fences
>
>
>
> vkGetEventStatus
>
> vkQueueWaitIdle
>
> vkDeviceWaitIdle
>
> vkGetQueryPoolResults
>
>                 amdgpu_cs_query_Fence_status
>
>
>
> vkQueueBindSparse
>
>                 amdgpu_bo_va_op
>
>                 amdgpu_bo_va_op_raw
>
>
>
> vkCreateSwapchainKHR
>
> vkAcquireNextImageKHR
>
> vkQueuePresentKHR
>
>                 Not related with u/k interface.
>
>
>
> Besides those listed above, I think 
> amdgpu_cs_signal_Sem/amdgpu_cs_wait_sem
> should respond to gpu reset as well."
>
>
> Christian.
>
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>