[PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted

nhaehnle@xxxxxxxxx (Nicolai Hähnle) · Tue, 10 Oct 2017 10:57:55 +0200

On 10.10.2017 10:21, Liu, Monk wrote:
>> As Nicolai described before:
> 
> The kernel will reject all command submission from contexts which where created before the VRAM lost happened.
> 
> ML: this is similar with what my strict mode reset do â?¹, except that my logic checks if the FD is opened before gpu reset, but Nicolai checks if the context created before VRAM lost,
> But comparing context creating timing with VRAM lost timing is not accurate enough like I said before:
> Even you have context created after VRAM lost, that doesn't mean you can allow this context submitting jobs, because some BO in the BO_LIST of passed in with this context may modified before GPU reset/VRAM lost,
> So the safest way is comparing the timing of FD opening
> 
> We expose the VRAM lost counter to userspace. When Mesa sees that a command submission is rejected it will query VRAM lost counter and declare all resources which where in VRAM at this moment as invalidated.
> E.g. shader binaries, texture descriptors etc.. will be reuploaded when they are used the next time.
> 
> The application needs to recreate it's GL context, just in the same way as it would if we found this context guilty of causing a reset.

Yes, for most applications. But this is *entirely* unrelated to 
re-opening the FD.

With OpenGL robustness contexts, what happens is that

1. Driver & application detect "context lost"
2. Application destroys the OpenGL context
--> driver destroys the kernel context and all associated buffer objects
3. Application creates a new OpenGL context
--> driver creates a new kernel context and new buffer objects

In this sequence, the content of buffer objects created before the VRAM 
lost is irrelevant because they will all be destroyed, and new command 
submissions will only use "fresh" buffer objects.

(Or, in the case of Mesa, it's actually possible that we re-use buffer 
objects from before VRAM lost due to the caching we do for performance, 
but the contents of those buffer objects will have been completely 
re-initialized.)

The FD simply isn't the right unit of granularity to track this.

Cheers,
Nicolai

> 
> 
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2017å¹´10æ??10æ?¥ 15:26
> To: Liu, Monk <Monk.Liu at amd.com>; Nicolai HÃ¤hnle <nhaehnle at gmail.com>; amd-gfx at lists.freedesktop.org; Daenzer, Michel <Michel.Daenzer at amd.com>
> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted
> 
> As Nicolai described before:
> 
> The kernel will reject all command submission from contexts which where created before the VRAM lost happened.
> 
> We expose the VRAM lost counter to userspace. When Mesa sees that a command submission is rejected it will query VRAM lost counter and declare all resources which where in VRAM at this moment as invalidated.
> E.g. shader binaries, texture descriptors etc.. will be reuploaded when they are used the next time.
> 
> The application needs to recreate it's GL context, just in the same way as it would if we found this context guilty of causing a reset.
> 
> You should be able to handle this the same way in Vulkan and I think we can expose the GPU reset counter to userspace as well. This way you can implement the strict mode in userspace and don't need to affect all applications with that. In other words the effect will be limited to the Vulkan stack.
> 
> Regards,
> Christian.
> 
> Am 10.10.2017 um 09:12 schrieb Liu, Monk:
>> Then the question is how we treat recovery if VRAM lost ?
>>
>> -----Original Message-----
>> From: Koenig, Christian
>> Sent: 2017å¹´10æ??10æ?¥ 14:59
>> To: Liu, Monk <Monk.Liu at amd.com>; Nicolai HÃ¤hnle <nhaehnle at gmail.com>;
>> amd-gfx at lists.freedesktop.org; Daenzer, Michel
>> <Michel.Daenzer at amd.com>
>> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu
>> reseted
>>
>> As Nicolai explained that approach simply won't work.
>>
>> The fd is used by more than just the closed source Vulkan driver and I think even by some components not developed by AMD (common X code?
>> Michel please comment as well).
>>
>> So closing it and reopening it to handle a GPU reset is simply not an option.
>>
>> Regards,
>> Christian.
>>
>> Am 10.10.2017 um 06:26 schrieb Liu, Monk:
>>> After VRAM lost happens, all clients no matter radv/mesa/ogl is
>>> useless,
>>>
>>> Any drivers uses this FD should be denied by KMD after VRAM lost, and
>>> UMD can destroy/close this FD and re-open it and rebuild all
>>> resources
>>>
>>> That's the only option for VRAM lost case
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Nicolai HÃ¤hnle [mailto:nhaehnle at gmail.com]
>>> Sent: 2017å¹´10æ??9æ?¥ 19:01
>>> To: Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian
>>> <Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org
>>> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu
>>> reseted
>>>
>>> On 09.10.2017 10:35, Liu, Monk wrote:
>>>> Please be aware that this policy is what the strict mode defined and
>>>> what customer want, And also please check VK spec, it defines that
>>>> after GPU reset all vk INSTANCE should close/release its
>>>> resource/device/ctx and all buffers, and call re-initvkinstance
>>>> after gpu reset
>>> Sorry, but you simply cannot implement a correct user-space implementation of those specs on top of this.
>>>
>>> It will break as soon as you have both OpenGL and Vulkan running in the same process (or heck, our Vulkan and radv :)), because both drivers will use the same fd.
>>>
>>> Cheers,
>>> Nicolai
>>>
>>>
>>>
>>>> So this whole approach is what just aligned with the spec, and to
>>>> not influence with current MESA/OGL client that's why I put the
>>>> whole approach into the strict mode And by default strict mode is
>>>> not selected
>>>>
>>>>
>>>> BR Monk
>>>>
>>>> -----Original Message-----
>>>> From: Christian KÃ¶nig [mailto:ckoenig.leichtzumerken at gmail.com]
>>>> Sent: 2017å¹´10æ??9æ?¥ 16:26
>>>> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
>>>> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu
>>>> reseted
>>>>
>>>> Am 30.09.2017 um 08:03 schrieb Monk Liu:
>>>>> for SRIOV strict mode gpu reset:
>>>>>
>>>>> In kms open we mark the latest adev->gpu_reset_counter in fpriv we
>>>>> return -ENODEV in cs_ioctl or info_ioctl if they found
>>>>> fpriv->gpu_reset_counter != adev->gpu_reset_counter.
>>>>>
>>>>> this way we prevent a potential bad process/FD from submitting cmds
>>>>> and notify userspace with -ENODEV.
>>>>>
>>>>> userspace should close all BO/ctx and re-open dri FD to re-create
>>>>> virtual memory system for this process
>>>> The whole aproach is a NAK from my side.
>>>>
>>>> We need to enable userspace to continue, not force it into process termination to recover. Otherwise we could send a SIGTERM in the first place.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Change-Id: Ib4c179f28a3d0783837566f29de07fc14aa9b9a4
>>>>> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
>>>>> ---
>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 5 +++++
>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 7 +++++++
>>>>>       3 files changed, 13 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> index de9c164..b40d4ba 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> @@ -772,6 +772,7 @@ struct amdgpu_fpriv {
>>>>>       	struct idr		bo_list_handles;
>>>>>       	struct amdgpu_ctx_mgr	ctx_mgr;
>>>>>       	u32			vram_lost_counter;
>>>>> +	int gpu_reset_counter;
>>>>>       };
>>>>>       
>>>>>       /*
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>> index 9467cf6..6a1515e 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>> @@ -1199,6 +1199,11 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
>>>>>       	if (amdgpu_kms_vram_lost(adev, fpriv))
>>>>>       		return -ENODEV;
>>>>>       
>>>>> +	if (amdgpu_sriov_vf(adev) &&
>>>>> +		amdgpu_sriov_reset_level == 1 &&
>>>>> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
>>>>> +		return -ENODEV;
>>>>> +
>>>>>       	parser.adev = adev;
>>>>>       	parser.filp = filp;
>>>>>       
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>>> index 282f45b..bd389cf 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>>> @@ -285,6 +285,11 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file
>>>>>       	if (amdgpu_kms_vram_lost(adev, fpriv))
>>>>>       		return -ENODEV;
>>>>>       
>>>>> +	if (amdgpu_sriov_vf(adev) &&
>>>>> +		amdgpu_sriov_reset_level == 1 &&
>>>>> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
>>>>> +		return -ENODEV;
>>>>> +
>>>>>       	switch (info->query) {
>>>>>       	case AMDGPU_INFO_ACCEL_WORKING:
>>>>>       		ui32 = adev->accel_working;
>>>>> @@ -824,6 +829,8 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>>>>>       		goto out_suspend;
>>>>>       	}
>>>>>       
>>>>> +	fpriv->gpu_reset_counter = atomic_read(&adev->gpu_reset_counter);
>>>>> +
>>>>>       	r = amdgpu_vm_init(adev, &fpriv->vm,
>>>>>       			   AMDGPU_VM_CONTEXT_GFX, 0);
>>>>>       	if (r) {
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>> --
>>> Lerne, wie die Welt wirklich ist,
>>> Aber vergiss niemals, wie sie sein sollte.
>>
> 

-- 
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.