[PATCH 4/4] drm/amdgpu: reset fpriv vram_lost_counter

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Yes, agree, it's easy to extend it as your like.
For innocent context, which would need us to find out the context of guilty job, which is TODO.

Regards,
David Zhou

-----Original Message-----
From: Marek Olšák [mailto:maraeo@xxxxxxxxx] 
Sent: Wednesday, May 17, 2017 5:50 PM
To: Christian König <deathsimple at vodafone.de>
Cc: Zhou, David(ChunMing) <David1.Zhou at amd.com>; Michel Dänzer <michel at daenzer.net>; amd-gfx mailing list <amd-gfx at lists.freedesktop.org>
Subject: Re: [PATCH 4/4] drm/amdgpu: reset fpriv vram_lost_counter

David,

We already have a query that returns whether a device is lost. It's called amdgpu_cs_query_reset_state. That should return whether a hang was caused by a certain context or whether the hang happened but the context is innocent. You can extend it to accept no context, in which case it will return either NO_RESET (everything is OK) or UNKNOWN_RESET (= when a hang happened but the caller didn't specify the context).

Marek

On Wed, May 17, 2017 at 10:56 AM, Christian König <deathsimple at vodafone.de> wrote:
> Am 17.05.2017 um 10:46 schrieb zhoucm1:
>
>
>
> On 2017å¹´05æ??17æ?¥ 16:40, Christian König wrote:
>
> Am 17.05.2017 um 10:01 schrieb Michel Dänzer:
>
> On 17/05/17 04:13 PM, zhoucm1 wrote:
>
> On 2017å¹´05æ??17æ?¥ 14:57, Michel Dänzer wrote:
>
> On 17/05/17 01:28 PM, zhoucm1 wrote:
>
> On 2017å¹´05æ??17æ?¥ 11:15, Michel Dänzer wrote:
>
> On 17/05/17 12:04 PM, zhoucm1 wrote:
>
> On 2017å¹´05æ??17æ?¥ 09:18, Michel Dänzer wrote:
>
> On 16/05/17 06:25 PM, Chunming Zhou wrote:
>
> Change-Id: I8eb6d7f558da05510e429d3bf1d48c8cec6c1977
> Signed-off-by: Chunming Zhou <David1.Zhou at amd.com>
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index bca1fb5..f3e7525 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -2547,6 +2547,9 @@ int amdgpu_vm_ioctl(struct drm_device *dev, void 
> *data, struct drm_file *filp)
>          case AMDGPU_VM_OP_UNRESERVE_VMID:
>              amdgpu_vm_free_reserved_vmid(adev, &fpriv->vm, 
> AMDGPU_GFXHUB);
>              break;
> +    case AMDGPU_VM_OP_RESET:
> +        fpriv->vram_lost_counter =
> atomic_read(&adev->vram_lost_counter);
> +        break;
>
> How do you envision the UMDs using this? I can mostly think of them 
> calling this ioctl when a context is created or destroyed. But that 
> would also allow any other remaining contexts using the same DRM file 
> descriptor to use all ioctls again. So, I think there needs to be a 
> vram_lost_counter in struct amdgpu_ctx instead of in struct 
> amdgpu_fpriv.
>
> struct amdgpu_fpriv for vram_lost_counter is proper place, especially 
> for ioctl return value.
> if you need to reset ctx one by one, we can mark all contexts of that 
> vm, and then reset by userspace.
>
> I'm not following. With vram_lost_counter in amdgpu_fpriv, if any 
> context calls this ioctl, all other contexts using the same file 
> descriptor will also be considered safe again, right?
>
> Yes, but it really depends on userspace requirement, if you need to 
> reset ctx one by one, we can mark all contexts of that vm to guilty, 
> and then reset one context by userspace.
>
> Still not sure what you mean by that.
>
> E.g. what do you mean by "guilty"? I thought that refers to the 
> context which caused a hang. But it seems like you're using it to 
> refer to any context which hasn't reacted yet to VRAM contents being lost.
>
> When vram is lost, we treat all contexts need to reset.
>
> Essentially, your patches only track VRAM contents being lost per file 
> descriptor, not per context. I'm not sure (rather skeptical) that this 
> is suitable for OpenGL UMDs, since state is usually tracked per context.
> Marek / Nicolai?
>
>
> Oh, yeah that's a good point.
>
> The problem with tracking it per context is that Vulkan also wants the 
> ENODEV on the amdgpu_gem_va_ioct() and amdgpu_info_ioctl() which are 
> context less.
>
> But thinking more about this blocking those two doesn't make much 
> sense. The VM content can be restored and why should be disallow reading GPU info?
>
> I can re-paste the Vulkan APIs requiring ENODEV:
> "
>
> The Vulkan APIs listed below could return VK_ERROR_DEVICE_LOST 
> according to the spec.
>
> I tries to provide a list of u/k interfaces that could be called for 
> each vk API.
>
>
> Well those are the Vulkan requirements, but that doesn't necessary 
> mean we must follow that on the kernel side. Keep in mind that Vulkan 
> can't made any requirements towards the kernel driver.
>
> IIRC we already have a query Vulkan can use to figure out if a GPU 
> reset happened or not. So they could use that instead.
>
> Regards,
> Christian.
>
>
>
> vkCreateDevice
>
> -          amdgpu_device_initialize.
>
> -          amdgpu_query_gpu_info
>
>
>
> vkQueueSubmit
>
> -          amdgpu_cs_submit
>
>
>
> vkWaitForFences
>
>                 amdgpu_cs_wait_fences
>
>
>
> vkGetEventStatus
>
> vkQueueWaitIdle
>
> vkDeviceWaitIdle
>
> vkGetQueryPoolResults
>
>                 amdgpu_cs_query_Fence_status
>
>
>
> vkQueueBindSparse
>
>                 amdgpu_bo_va_op
>
>                 amdgpu_bo_va_op_raw
>
>
>
> vkCreateSwapchainKHR
>
> vkAcquireNextImageKHR
>
> vkQueuePresentKHR
>
>                 Not related with u/k interface.
>
>
>
> Besides those listed above, I think 
> amdgpu_cs_signal_Sem/amdgpu_cs_wait_sem
> should respond to gpu reset as well."
>
>
> Christian.
>
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>


[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux