RE: [PATCH v2 6/6] drm/amdgpu: reduce reset time

"Lazar, Lijo" <Lijo.Lazar@xxxxxxx> · Fri, 29 Jul 2022 09:39:48 +0000

[AMD Official Use Only - General]

XGMI connected GPUs share the same reset domain software object. If the devices are not connected with each other, there is one domain per device.

Thanks,
Lijo

-----Original Message-----
From: Zhao, Victor <Victor.Zhao@xxxxxxx> 
Sent: Friday, July 29, 2022 2:58 PM
To: Lazar, Lijo <Lijo.Lazar@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Deng, Emily <Emily.Deng@xxxxxxx>; Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>
Subject: RE: [PATCH v2 6/6] drm/amdgpu: reduce reset time

[AMD Official Use Only - General]

Hi Lijo,

Can you be more specific why other gpus can be affected? I don’t have a xgmi system on my side. I thought reset_domain is per device.

Thanks,
Victor

-----Original Message-----
From: Lazar, Lijo <Lijo.Lazar@xxxxxxx> 
Sent: Friday, July 29, 2022 2:11 PM
To: Zhao, Victor <Victor.Zhao@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Deng, Emily <Emily.Deng@xxxxxxx>; Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>
Subject: Re: [PATCH v2 6/6] drm/amdgpu: reduce reset time

On 7/28/2022 4:00 PM, Victor Zhao wrote:
> In multi container use case, reset time is important, so skip ring 
> tests and cp halt wait during ip suspending for reset as they are 
> going to fail and cost more time on reset
> 
> v2: add a hang flag to indicate the reset comes from a job timeout, 
> skip ring test and cp halt wait in this case
> 
> Signed-off-by: Victor Zhao <Victor.Zhao@xxxxxxx>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c   |  3 ++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |  2 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c |  1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h |  1 +
>   drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c    | 11 +++++++++--
>   5 files changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index 222d3d7ea076..c735a17c6afb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -27,6 +27,7 @@
>   #include "amdgpu_gfx.h"
>   #include "amdgpu_rlc.h"
>   #include "amdgpu_ras.h"
> +#include "amdgpu_reset.h"
>   
>   /* delay 0.1 second to enable gfx off feature */
>   #define GFX_OFF_DELAY_ENABLE         msecs_to_jiffies(100)
> @@ -477,7 +478,7 @@ int amdgpu_gfx_disable_kcq(struct amdgpu_device *adev)
>   		kiq->pmf->kiq_unmap_queues(kiq_ring, &adev->gfx.compute_ring[i],
>   					   RESET_QUEUES, 0, 0);
>   
> -	if (adev->gfx.kiq.ring.sched.ready)
> +	if (adev->gfx.kiq.ring.sched.ready && !(amdgpu_in_reset(adev) && 
> +adev->reset_domain->hang))
>   		r = amdgpu_ring_test_helper(kiq_ring);

On a system with multiple GPUs interconnected, this will affect other GPUs as well on which job was not really running. I guess your usecase here is device specific.

Thanks,
Lijo

>   	spin_unlock(&adev->gfx.kiq.ring_lock);
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 6c3e7290153f..bb40880a557f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -49,6 +49,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>   	}
>   
>   	memset(&ti, 0, sizeof(struct amdgpu_task_info));
> +	adev->reset_domain->hang = true;
>   
>   	if (amdgpu_gpu_recovery &&
>   	    amdgpu_ring_soft_recovery(ring, job->vmid, 
> s_job->s_fence->parent)) { @@ -83,6 +84,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>   	}
>   
>   exit:
> +	adev->reset_domain->hang = false;
>   	drm_dev_exit(idx);
>   	return DRM_GPU_SCHED_STAT_NOMINAL;
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> index 9da5ead50c90..b828fe773f50 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> @@ -155,6 +155,7 @@ struct amdgpu_reset_domain *amdgpu_reset_create_reset_domain(enum amdgpu_reset_d
>   	atomic_set(&reset_domain->in_gpu_reset, 0);
>   	atomic_set(&reset_domain->reset_res, 0);
>   	init_rwsem(&reset_domain->sem);
> +	reset_domain->hang = false;
>   
>   	return reset_domain;
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
> index cc4b2eeb24cf..29e324add552 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
> @@ -84,6 +84,7 @@ struct amdgpu_reset_domain {
>   	struct rw_semaphore sem;
>   	atomic_t in_gpu_reset;
>   	atomic_t reset_res;
> +	bool hang;
>   };
>   
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> index fafbad3cf08d..a384e04d916c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> @@ -29,6 +29,7 @@
>   #include "amdgpu.h"
>   #include "amdgpu_gfx.h"
>   #include "amdgpu_psp.h"
> +#include "amdgpu_reset.h"
>   #include "nv.h"
>   #include "nvd.h"
>   
> @@ -5971,6 +5972,9 @@ static int gfx_v10_0_cp_gfx_enable(struct amdgpu_device *adev, bool enable)
>   		WREG32_SOC15(GC, 0, mmCP_ME_CNTL, tmp);
>   	}
>   
> +	if ((amdgpu_in_reset(adev) && adev->reset_domain->hang) && !enable)
> +		return 0;
> +
>   	for (i = 0; i < adev->usec_timeout; i++) {
>   		if (RREG32_SOC15(GC, 0, mmCP_STAT) == 0)
>   			break;
> @@ -7569,8 +7573,10 @@ static int gfx_v10_0_kiq_disable_kgq(struct amdgpu_device *adev)
>   	for (i = 0; i < adev->gfx.num_gfx_rings; i++)
>   		kiq->pmf->kiq_unmap_queues(kiq_ring, &adev->gfx.gfx_ring[i],
>   					   PREEMPT_QUEUES, 0, 0);
> -
> -	return amdgpu_ring_test_helper(kiq_ring);
> +	if (!(amdgpu_in_reset(adev) && adev->reset_domain->hang))
> +		return amdgpu_ring_test_helper(kiq_ring);
> +	else
> +		return 0;
>   }
>   #endif
>   
> @@ -7610,6 +7616,7 @@ static int gfx_v10_0_hw_fini(void *handle)
>   
>   		return 0;
>   	}
> +
>   	gfx_v10_0_cp_enable(adev, false);
>   	gfx_v10_0_enable_gui_idle_interrupt(adev, false);
>   
>