[PATCH 1/4] drm/amdgpu:don't invoke srio-gpu-reset in gpu-reset

deathsimple@xxxxxxxxxxx (Christian König) · Mon, 8 May 2017 11:50:38 +0200

> Because we can always rely on TDR and HYPERVISOR to detect GPU hang and resubmit malicious jobs or even kick them out later,
> and the gpu reset will eventually be invoked, so there is no reason to manually and voluntarily call gpu reset under SRIOV case.
Well there is a rather good reason, we detect that something is wrong 
much faster than waiting for the timeout.

But I agree that it was broken before as well and we can fix that later. 
Please add a code comment that this needs more work.

With that fixed feel free to add my rb on it.

Christian.

Am 08.05.2017 um 11:42 schrieb Liu, Monk:
> The VM fault interrupt or illegal instruction  will be delivered to GPU no matter it's SR-IOV or bare-metal case,
> And I removed them from invoking GPU reset is due to the same reason:
> Don't trigger gpu reset for sriov case if possible, always beware that trigger GPU reset under SR-IOV is a heavy cost (need take full access mode on GPU, so all
> Other VFs will be paused for a while)
>
> Because we can always rely on TDR and HYPERVISOR to detect GPU hang and resubmit malicious jobs or even kick them out later,
> and the gpu reset will eventually be invoked, so there is no reason to manually and voluntarily call gpu reset under SRIOV case.
>
> BR Monk
>
>
> -----Original Message-----
> From: Christian KÃ¶nig [mailto:deathsimple at vodafone.de]
> Sent: Monday, May 08, 2017 5:34 PM
> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH 1/4] drm/amdgpu:don't invoke srio-gpu-reset in gpu-reset
>
> Sounds good, but what do we do with the amdgpu_irq_reset_work_func?
>
> Please note that I find that calling amdgpu_gpu_reset() here is a bad idea in the first place.
>
> Instead we should consider the scheduler as faulting and let the scheduler handle that as in the same way as a job timeout.
>
> But I'm not sure if those interrupts are actually send under SRIOV or if the hypervisor handles them somehow.
>
> Christian.
>
> Am 08.05.2017 um 11:24 schrieb Liu, Monk:
>> I agree with disabling debugfs for amdgpu_reset when SRIOV detected.
>>
>> -----Original Message-----
>> From: Christian KÃ¶nig [mailto:deathsimple at vodafone.de]
>> Sent: Monday, May 08, 2017 5:20 PM
>> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH 1/4] drm/amdgpu:don't invoke srio-gpu-reset in
>> gpu-reset
>>
>>> You know that gpu reset under SR-IOV will have very big impact on all other VFs ...
>> Mhm, good argument. But in this case we need to give at least some warning message instead of doing nothing.
>>
>> Or even better disable creating the amdgpu_reste debugfs file altogether. This way nobody will wonder why using it doesn't trigger anything.
>>
>> Christian.
>>
>> Am 08.05.2017 um 11:10 schrieb Liu, Monk:
>>> For SR-IOV use case, we call gpu reset under the case we have no choice ...
>>>
>>> So many places like debug fs shouldn't a good reason to trigger gpu
>>> reset
>>>
>>> You know that gpu reset under SR-IOV will have very big impact on all other VFs ...
>>>
>>> BR Monk
>>>
>>> -----Original Message-----
>>> From: Christian KÃ¶nig [mailto:deathsimple at vodafone.de]
>>> Sent: Monday, May 08, 2017 5:08 PM
>>> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
>>> Subject: Re: [PATCH 1/4] drm/amdgpu:don't invoke srio-gpu-reset in
>>> gpu-reset
>>>
>>> Am 08.05.2017 um 08:51 schrieb Monk Liu:
>>>> because we don't want to do sriov-gpu-reset under certain cases, so
>>>> just split those two funtion and don't invoke sr-iov one from
>>>> bare-metal one.
>>>>
>>>> Change-Id: I641126c241e2ee2dfd54e6d16c389b159f99cfe0
>>>> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
>>>> ---
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ---
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 3 ++-
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c    | 2 +-
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c    | 3 ++-
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 6 +++++-
>>>>      5 files changed, 10 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index 45a60a6..4985a7e 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -2652,9 +2652,6 @@ int amdgpu_gpu_reset(struct amdgpu_device *adev)
>>>>      	int resched;
>>>>      	bool need_full_reset;
>>>>      
>>>> -	if (amdgpu_sriov_vf(adev))
>>>> -		return amdgpu_sriov_gpu_reset(adev, true);
>>>> -
>>>>      	if (!amdgpu_check_soft_reset(adev)) {
>>>>      		DRM_INFO("No hardware hang detected. Did some blocks stall?\n");
>>>>      		return 0;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> index 5772ef2..d7523d1 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> @@ -651,7 +651,8 @@ static int amdgpu_debugfs_gpu_reset(struct seq_file *m, void *data)
>>>>      	struct amdgpu_device *adev = dev->dev_private;
>>>>      
>>>>      	seq_printf(m, "gpu reset\n");
>>>> -	amdgpu_gpu_reset(adev);
>>>> +	if (!amdgpu_sriov_vf(adev))
>>>> +		amdgpu_gpu_reset(adev);
>>> Well that is clearly not a good idea. Why do you want to disable the reset here?
>>>
>>>>      
>>>>      	return 0;
>>>>      }
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>> index 67be795..5bcbea0 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>> @@ -221,7 +221,7 @@ void amdgpu_gem_object_close(struct
>>>> drm_gem_object *obj,
>>>>      
>>>>      static int amdgpu_gem_handle_lockup(struct amdgpu_device *adev, int r)
>>>>      {
>>>> -	if (r == -EDEADLK) {
>>>> +	if (r == -EDEADLK && !amdgpu_sriov_vf(adev)) {
>>> Not a problem of your patch, but that stuff is outdated and should have been removed completely years ago. Going to take care of that.
>>>
>>>>      		r = amdgpu_gpu_reset(adev);
>>>>      		if (!r)
>>>>      			r = -EAGAIN;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>>>> index f8a6c95..49c6e6e 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>>>> @@ -89,7 +89,8 @@ static void amdgpu_irq_reset_work_func(struct work_struct *work)
>>>>      	struct amdgpu_device *adev = container_of(work, struct amdgpu_device,
>>>>      						  reset_work);
>>>>      
>>>> -	amdgpu_gpu_reset(adev);
>>>> +	if (!amdgpu_sriov_vf(adev))
>>>> +		amdgpu_gpu_reset(adev);
>>> Mhm, that disables the reset on an invalid register access or invalid command stream. Is that really what we want?
>>>
>>> Christian.
>>>
>>>>      }
>>>>      
>>>>      /* Disable *all* interrupts */
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> index 690ef3d..c7718af 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> @@ -36,7 +36,11 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
>>>>      		  job->base.sched->name,
>>>>      		  atomic_read(&job->ring->fence_drv.last_seq),
>>>>      		  job->ring->fence_drv.sync_seq);
>>>> -	amdgpu_gpu_reset(job->adev);
>>>> +
>>>> +	if (amdgpu_sriov_vf(job->adev))
>>>> +		amdgpu_sriov_gpu_reset(job->adev, true);
>>>> +	else
>>>> +		amdgpu_gpu_reset(job->adev);
>>>>      }
>>>>      
>>>>      int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned
>>>> num_ibs,
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx