Then the question is how we treat recovery if VRAM lost ? -----Original Message----- From: Koenig, Christian Sent: 2017å¹´10æ??10æ?¥ 14:59 To: Liu, Monk <Monk.Liu at amd.com>; Nicolai Hähnle <nhaehnle at gmail.com>; amd-gfx at lists.freedesktop.org; Daenzer, Michel <Michel.Daenzer at amd.com> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted As Nicolai explained that approach simply won't work. The fd is used by more than just the closed source Vulkan driver and I think even by some components not developed by AMD (common X code? Michel please comment as well). So closing it and reopening it to handle a GPU reset is simply not an option. Regards, Christian. Am 10.10.2017 um 06:26 schrieb Liu, Monk: > After VRAM lost happens, all clients no matter radv/mesa/ogl is > useless, > > Any drivers uses this FD should be denied by KMD after VRAM lost, and > UMD can destroy/close this FD and re-open it and rebuild all resources > > That's the only option for VRAM lost case > > > > -----Original Message----- > From: Nicolai Hähnle [mailto:nhaehnle at gmail.com] > Sent: 2017å¹´10æ??9æ?¥ 19:01 > To: Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian > <Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org > Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu > reseted > > On 09.10.2017 10:35, Liu, Monk wrote: >> Please be aware that this policy is what the strict mode defined and >> what customer want, And also please check VK spec, it defines that >> after GPU reset all vk INSTANCE should close/release its >> resource/device/ctx and all buffers, and call re-initvkinstance after >> gpu reset > Sorry, but you simply cannot implement a correct user-space implementation of those specs on top of this. > > It will break as soon as you have both OpenGL and Vulkan running in the same process (or heck, our Vulkan and radv :)), because both drivers will use the same fd. > > Cheers, > Nicolai > > > >> So this whole approach is what just aligned with the spec, and to not >> influence with current MESA/OGL client that's why I put the whole >> approach into the strict mode And by default strict mode is not >> selected >> >> >> BR Monk >> >> -----Original Message----- >> From: Christian König [mailto:ckoenig.leichtzumerken at gmail.com] >> Sent: 2017å¹´10æ??9æ?¥ 16:26 >> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org >> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu >> reseted >> >> Am 30.09.2017 um 08:03 schrieb Monk Liu: >>> for SRIOV strict mode gpu reset: >>> >>> In kms open we mark the latest adev->gpu_reset_counter in fpriv we >>> return -ENODEV in cs_ioctl or info_ioctl if they found >>> fpriv->gpu_reset_counter != adev->gpu_reset_counter. >>> >>> this way we prevent a potential bad process/FD from submitting cmds >>> and notify userspace with -ENODEV. >>> >>> userspace should close all BO/ctx and re-open dri FD to re-create >>> virtual memory system for this process >> The whole aproach is a NAK from my side. >> >> We need to enable userspace to continue, not force it into process termination to recover. Otherwise we could send a SIGTERM in the first place. >> >> Regards, >> Christian. >> >>> Change-Id: Ib4c179f28a3d0783837566f29de07fc14aa9b9a4 >>> Signed-off-by: Monk Liu <Monk.Liu at amd.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + >>> drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 5 +++++ >>> drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 7 +++++++ >>> 3 files changed, 13 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>> index de9c164..b40d4ba 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>> @@ -772,6 +772,7 @@ struct amdgpu_fpriv { >>> struct idr bo_list_handles; >>> struct amdgpu_ctx_mgr ctx_mgr; >>> u32 vram_lost_counter; >>> + int gpu_reset_counter; >>> }; >>> >>> /* >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c >>> index 9467cf6..6a1515e 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c >>> @@ -1199,6 +1199,11 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp) >>> if (amdgpu_kms_vram_lost(adev, fpriv)) >>> return -ENODEV; >>> >>> + if (amdgpu_sriov_vf(adev) && >>> + amdgpu_sriov_reset_level == 1 && >>> + fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter)) >>> + return -ENODEV; >>> + >>> parser.adev = adev; >>> parser.filp = filp; >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>> index 282f45b..bd389cf 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>> @@ -285,6 +285,11 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file >>> if (amdgpu_kms_vram_lost(adev, fpriv)) >>> return -ENODEV; >>> >>> + if (amdgpu_sriov_vf(adev) && >>> + amdgpu_sriov_reset_level == 1 && >>> + fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter)) >>> + return -ENODEV; >>> + >>> switch (info->query) { >>> case AMDGPU_INFO_ACCEL_WORKING: >>> ui32 = adev->accel_working; >>> @@ -824,6 +829,8 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv) >>> goto out_suspend; >>> } >>> >>> + fpriv->gpu_reset_counter = atomic_read(&adev->gpu_reset_counter); >>> + >>> r = amdgpu_vm_init(adev, &fpriv->vm, >>> AMDGPU_VM_CONTEXT_GFX, 0); >>> if (r) { >> >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx at lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> > > -- > Lerne, wie die Welt wirklich ist, > Aber vergiss niemals, wie sie sein sollte.