Re: [PATCH] drm/amdkfd: Change WARN to pr_debug when same userptr BOs got invalidated by mmu.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2023-04-10 10:36, Xiaogang.Chen wrote:
From: Xiaogang Chen <xiaogang.chen@xxxxxxx>

During KFD restore evicted userptr BOs mmu invalidate callback may invalidate
same userptr BOs that have been just restored. When KFD restore process detects
it KFD will reschedule another validation process. It is not an error. Change
WARN to pr_debug, not put the BOs at userptr_valid_list, let next scheduled
delayed work validate them again.

The problem is not, that a concurrent MMU notifier invalidated the pages. The problem is, that the sequence number and the mem->inval flag disagree on this. In theory, both the sequence number and the mem->inval flag are updated by amdgpu_amdkfd_evict_userptr in the same critical section.

When we validate the BO, we set mem->valid to true. If mem->valid gets set back to false later, the sequence number should also be updated so that amdgpu_ttm_tt_get_user_pages_done should return false. So mem->valid and the sequence number should agree on whether the memory is valid. However, these WARNs indicate that there is a mismatch. If that happens, it means something went wrong. Some of the code's assumptions are being violated and this justifies a WARN.

I think you mentioned that you only see the warnings with the DKMS driver. I suspect this is happening on some old get_user_pages code path, not the current hmm_range_fault-based one. I would recommend looking into this problem on the DKMS branch and addressing the problem there. This should not be fixed but removing legitimate WARNs on the upstream branch.

Regards,
  Felix



Signed-off-by: Xiaogang Chen <Xiaogang.Chen@xxxxxxx>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 13 ++++++++++---
  1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 7b1f5933ebaa..d0c224703278 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -2581,11 +2581,18 @@ static int confirm_valid_user_pages_locked(struct amdkfd_process_info *process_i
mem->range = NULL;
  		if (!valid) {
-			WARN(!mem->invalid, "Invalid BO not marked invalid");
+			if (!mem->invalid)
+				pr_debug("Invalid BO not marked invalid\n");
+
+			ret = -EAGAIN;
+			continue;
+		}
+
+		if (mem->invalid) {
+			pr_debug("Valid BO is marked invalid\n");
  			ret = -EAGAIN;
  			continue;
  		}
-		WARN(mem->invalid, "Valid BO is marked invalid");
list_move_tail(&mem->validate_list.head,
  			       &process_info->userptr_valid_list);
@@ -2648,7 +2655,7 @@ static void amdgpu_amdkfd_restore_userptr_worker(struct work_struct *work)
  		goto unlock_notifier_out;
if (confirm_valid_user_pages_locked(process_info)) {
-		WARN(1, "User pages unexpectedly invalid");
+		pr_debug("User pages unexpectedly invalid, reschedule another attempt\n");
  		goto unlock_notifier_out;
  	}



[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux