Yes, agree, it's easy to extend it as your like. For innocent context, which would need us to find out the context of guilty job, which is TODO. Regards, David Zhou -----Original Message----- From: Marek Olšák [mailto:maraeo@xxxxxxxxx] Sent: Wednesday, May 17, 2017 5:50 PM To: Christian König <deathsimple at vodafone.de> Cc: Zhou, David(ChunMing) <David1.Zhou at amd.com>; Michel Dänzer <michel at daenzer.net>; amd-gfx mailing list <amd-gfx at lists.freedesktop.org> Subject: Re: [PATCH 4/4] drm/amdgpu: reset fpriv vram_lost_counter David, We already have a query that returns whether a device is lost. It's called amdgpu_cs_query_reset_state. That should return whether a hang was caused by a certain context or whether the hang happened but the context is innocent. You can extend it to accept no context, in which case it will return either NO_RESET (everything is OK) or UNKNOWN_RESET (= when a hang happened but the caller didn't specify the context). Marek On Wed, May 17, 2017 at 10:56 AM, Christian König <deathsimple at vodafone.de> wrote: > Am 17.05.2017 um 10:46 schrieb zhoucm1: > > > > On 2017å¹´05æ??17æ?¥ 16:40, Christian König wrote: > > Am 17.05.2017 um 10:01 schrieb Michel Dänzer: > > On 17/05/17 04:13 PM, zhoucm1 wrote: > > On 2017å¹´05æ??17æ?¥ 14:57, Michel Dänzer wrote: > > On 17/05/17 01:28 PM, zhoucm1 wrote: > > On 2017å¹´05æ??17æ?¥ 11:15, Michel Dänzer wrote: > > On 17/05/17 12:04 PM, zhoucm1 wrote: > > On 2017å¹´05æ??17æ?¥ 09:18, Michel Dänzer wrote: > > On 16/05/17 06:25 PM, Chunming Zhou wrote: > > Change-Id: I8eb6d7f558da05510e429d3bf1d48c8cec6c1977 > Signed-off-by: Chunming Zhou <David1.Zhou at amd.com> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > index bca1fb5..f3e7525 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > @@ -2547,6 +2547,9 @@ int amdgpu_vm_ioctl(struct drm_device *dev, void > *data, struct drm_file *filp) > case AMDGPU_VM_OP_UNRESERVE_VMID: > amdgpu_vm_free_reserved_vmid(adev, &fpriv->vm, > AMDGPU_GFXHUB); > break; > + case AMDGPU_VM_OP_RESET: > + fpriv->vram_lost_counter = > atomic_read(&adev->vram_lost_counter); > + break; > > How do you envision the UMDs using this? I can mostly think of them > calling this ioctl when a context is created or destroyed. But that > would also allow any other remaining contexts using the same DRM file > descriptor to use all ioctls again. So, I think there needs to be a > vram_lost_counter in struct amdgpu_ctx instead of in struct > amdgpu_fpriv. > > struct amdgpu_fpriv for vram_lost_counter is proper place, especially > for ioctl return value. > if you need to reset ctx one by one, we can mark all contexts of that > vm, and then reset by userspace. > > I'm not following. With vram_lost_counter in amdgpu_fpriv, if any > context calls this ioctl, all other contexts using the same file > descriptor will also be considered safe again, right? > > Yes, but it really depends on userspace requirement, if you need to > reset ctx one by one, we can mark all contexts of that vm to guilty, > and then reset one context by userspace. > > Still not sure what you mean by that. > > E.g. what do you mean by "guilty"? I thought that refers to the > context which caused a hang. But it seems like you're using it to > refer to any context which hasn't reacted yet to VRAM contents being lost. > > When vram is lost, we treat all contexts need to reset. > > Essentially, your patches only track VRAM contents being lost per file > descriptor, not per context. I'm not sure (rather skeptical) that this > is suitable for OpenGL UMDs, since state is usually tracked per context. > Marek / Nicolai? > > > Oh, yeah that's a good point. > > The problem with tracking it per context is that Vulkan also wants the > ENODEV on the amdgpu_gem_va_ioct() and amdgpu_info_ioctl() which are > context less. > > But thinking more about this blocking those two doesn't make much > sense. The VM content can be restored and why should be disallow reading GPU info? > > I can re-paste the Vulkan APIs requiring ENODEV: > " > > The Vulkan APIs listed below could return VK_ERROR_DEVICE_LOST > according to the spec. > > I tries to provide a list of u/k interfaces that could be called for > each vk API. > > > Well those are the Vulkan requirements, but that doesn't necessary > mean we must follow that on the kernel side. Keep in mind that Vulkan > can't made any requirements towards the kernel driver. > > IIRC we already have a query Vulkan can use to figure out if a GPU > reset happened or not. So they could use that instead. > > Regards, > Christian. > > > > vkCreateDevice > > - amdgpu_device_initialize. > > - amdgpu_query_gpu_info > > > > vkQueueSubmit > > - amdgpu_cs_submit > > > > vkWaitForFences > > amdgpu_cs_wait_fences > > > > vkGetEventStatus > > vkQueueWaitIdle > > vkDeviceWaitIdle > > vkGetQueryPoolResults > > amdgpu_cs_query_Fence_status > > > > vkQueueBindSparse > > amdgpu_bo_va_op > > amdgpu_bo_va_op_raw > > > > vkCreateSwapchainKHR > > vkAcquireNextImageKHR > > vkQueuePresentKHR > > Not related with u/k interface. > > > > Besides those listed above, I think > amdgpu_cs_signal_Sem/amdgpu_cs_wait_sem > should respond to gpu reset as well." > > > Christian. > > > > > _______________________________________________ > amd-gfx mailing list > amd-gfx at lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx > > > > _______________________________________________ > amd-gfx mailing list > amd-gfx at lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx >