Re: [PATCH v5 5/6] drm/amdkfd: Increase KFD bo restore wait time

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


On 2024-04-23 11:28, Philip Yang wrote:
TTM allocate contiguous VRAM may takes more than 1 second to evict BOs
for larger size RDMA buffer. Because KFD restore bo worker reserves all
KFD BOs, then TTM cannot hold the remainning KFD BOs lock to evict them,
this causes TTM failed to alloc contiguous VRAM.

Increase the KFD restore BO wait time to 2 seconds, long enough for RDMA
pin BO to alloc the contiguous VRAM.

Two seconds is a very long time that the GPU will be idle whenever memory gets evicted. Maybe we need to look for a solution where the restore gets scheduled in response to a fence when the migration completes.

With my most recent changes I made to the eviction fence handling, I think we can decouple the scheduling of the restore work from the evict work. So we could schedule the delayed restore worker in a fence callback set up in amdgpu_bo_move or somewhere around there, and keep a short delay that starts counting at the end of the eviction move blit.


Signed-off-by: Philip Yang <Philip.Yang@xxxxxxx>
  drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index a81ef232fdef..c205e2d3acf9 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -698,7 +698,7 @@ struct qcm_process_device {
  /* KFD Memory Eviction */
/* Approx. wait time before attempting to restore evicted BOs */
  /* Approx. back off time if restore fails due to lack of memory */
  /* Approx. time before evicting the process again */

[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux