[PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted

Monk.Liu@xxxxxxx (Liu, Monk) · Tue, 10 Oct 2017 07:12:55 +0000

Then the question is how we treat recovery if VRAM lost ?

-----Original Message-----
From: Koenig, Christian 
Sent: 2017å¹´10æ??10æ?¥ 14:59
To: Liu, Monk <Monk.Liu at amd.com>; Nicolai HÃ¤hnle <nhaehnle at gmail.com>; amd-gfx at lists.freedesktop.org; Daenzer, Michel <Michel.Daenzer at amd.com>
Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted

As Nicolai explained that approach simply won't work.

The fd is used by more than just the closed source Vulkan driver and I think even by some components not developed by AMD (common X code? 
Michel please comment as well).

So closing it and reopening it to handle a GPU reset is simply not an option.

Regards,
Christian.

Am 10.10.2017 um 06:26 schrieb Liu, Monk:
> After VRAM lost happens, all clients no matter radv/mesa/ogl is 
> useless,
>
> Any drivers uses this FD should be denied by KMD after VRAM lost, and 
> UMD can destroy/close this FD and re-open it and rebuild all resources
>
> That's the only option for VRAM lost case
>
>
>
> -----Original Message-----
> From: Nicolai HÃ¤hnle [mailto:nhaehnle at gmail.com]
> Sent: 2017å¹´10æ??9æ?¥ 19:01
> To: Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian 
> <Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu 
> reseted
>
> On 09.10.2017 10:35, Liu, Monk wrote:
>> Please be aware that this policy is what the strict mode defined and 
>> what customer want, And also please check VK spec, it defines that 
>> after GPU reset all vk INSTANCE should close/release its 
>> resource/device/ctx and all buffers, and call re-initvkinstance after 
>> gpu reset
> Sorry, but you simply cannot implement a correct user-space implementation of those specs on top of this.
>
> It will break as soon as you have both OpenGL and Vulkan running in the same process (or heck, our Vulkan and radv :)), because both drivers will use the same fd.
>
> Cheers,
> Nicolai
>
>
>
>> So this whole approach is what just aligned with the spec, and to not 
>> influence with current MESA/OGL client that's why I put the whole 
>> approach into the strict mode And by default strict mode is not 
>> selected
>>
>>
>> BR Monk
>>
>> -----Original Message-----
>> From: Christian KÃ¶nig [mailto:ckoenig.leichtzumerken at gmail.com]
>> Sent: 2017å¹´10æ??9æ?¥ 16:26
>> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu 
>> reseted
>>
>> Am 30.09.2017 um 08:03 schrieb Monk Liu:
>>> for SRIOV strict mode gpu reset:
>>>
>>> In kms open we mark the latest adev->gpu_reset_counter in fpriv we 
>>> return -ENODEV in cs_ioctl or info_ioctl if they found
>>> fpriv->gpu_reset_counter != adev->gpu_reset_counter.
>>>
>>> this way we prevent a potential bad process/FD from submitting cmds 
>>> and notify userspace with -ENODEV.
>>>
>>> userspace should close all BO/ctx and re-open dri FD to re-create 
>>> virtual memory system for this process
>> The whole aproach is a NAK from my side.
>>
>> We need to enable userspace to continue, not force it into process termination to recover. Otherwise we could send a SIGTERM in the first place.
>>
>> Regards,
>> Christian.
>>
>>> Change-Id: Ib4c179f28a3d0783837566f29de07fc14aa9b9a4
>>> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 5 +++++
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 7 +++++++
>>>     3 files changed, 13 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index de9c164..b40d4ba 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -772,6 +772,7 @@ struct amdgpu_fpriv {
>>>     	struct idr		bo_list_handles;
>>>     	struct amdgpu_ctx_mgr	ctx_mgr;
>>>     	u32			vram_lost_counter;
>>> +	int gpu_reset_counter;
>>>     };
>>>     
>>>     /*
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> index 9467cf6..6a1515e 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> @@ -1199,6 +1199,11 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
>>>     	if (amdgpu_kms_vram_lost(adev, fpriv))
>>>     		return -ENODEV;
>>>     
>>> +	if (amdgpu_sriov_vf(adev) &&
>>> +		amdgpu_sriov_reset_level == 1 &&
>>> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
>>> +		return -ENODEV;
>>> +
>>>     	parser.adev = adev;
>>>     	parser.filp = filp;
>>>     
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> index 282f45b..bd389cf 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> @@ -285,6 +285,11 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file
>>>     	if (amdgpu_kms_vram_lost(adev, fpriv))
>>>     		return -ENODEV;
>>>     
>>> +	if (amdgpu_sriov_vf(adev) &&
>>> +		amdgpu_sriov_reset_level == 1 &&
>>> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
>>> +		return -ENODEV;
>>> +
>>>     	switch (info->query) {
>>>     	case AMDGPU_INFO_ACCEL_WORKING:
>>>     		ui32 = adev->accel_working;
>>> @@ -824,6 +829,8 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>>>     		goto out_suspend;
>>>     	}
>>>     
>>> +	fpriv->gpu_reset_counter = atomic_read(&adev->gpu_reset_counter);
>>> +
>>>     	r = amdgpu_vm_init(adev, &fpriv->vm,
>>>     			   AMDGPU_VM_CONTEXT_GFX, 0);
>>>     	if (r) {
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>
> --
> Lerne, wie die Welt wirklich ist,
> Aber vergiss niemals, wie sie sein sollte.