Re: [PATCH] drm/amdkfd: Change WARN to pr_debug when same userptr BOs got invalidated by mmu.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 4/10/2023 2:58 PM, Felix Kuehling wrote:
On 2023-04-10 10:36, Xiaogang.Chen wrote:
From: Xiaogang Chen <xiaogang.chen@xxxxxxx>

During KFD restore evicted userptr BOs mmu invalidate callback may invalidate same userptr BOs that have been just restored. When KFD restore process detects it KFD will reschedule another validation process. It is not an error. Change WARN to pr_debug, not put the BOs at userptr_valid_list, let next scheduled
delayed work validate them again.

The problem is not, that a concurrent MMU notifier invalidated the pages. The problem is, that the sequence number and the mem->inval flag disagree on this. In theory, both the sequence number and the mem->inval flag are updated by amdgpu_amdkfd_evict_userptr in the same critical section.

When we validate the BO, we set mem->valid to true. If mem->valid gets set back to false later, the sequence number should also be updated so that amdgpu_ttm_tt_get_user_pages_done should return false. So mem->valid and the sequence number should agree on whether the memory is valid. However, these WARNs indicate that there is a mismatch. If that happens, it means something went wrong. Some of the code's assumptions are being violated and this justifies a WARN.

yes, mem->invalid change and mmu_interval_set_seq(mni, cur_seq) at amdgpu_amdkfd_evict_userptr are under process_info->notifier_lock protection, but mem->range->notifier_seq could be stale when mmu notifier happens concurrently.
I think you mentioned that you only see the warnings with the DKMS driver. I suspect this is happening on some old get_user_pages code path, not the current hmm_range_fault-based one. I would recommend looking into this problem on the DKMS branch and addressing the problem there. This should not be fixed but removing legitimate WARNs on the upstream branch.

It happened with dkms driver with HAVE_AMDKCL_HMM_MIRROR_ENABLED enabled, it is not old get_user_pages path. I retraced this part code. I think there are two issues:

1: amdgpu_ttm_tt_get_user_pages can fail(-EFAULT). If it failed we should not set it valid(mem->invalid = 0). In this case mem has no associated hmm range associated.

2: mmu notifier can happen concurrently and update mem->range->notifier->invalidate_seq, but not mem->range->notifier_seq. That causes mem->range->notifier_seq stale when mem is in process_info->userptr_inval_list and amdgpu_amdkfd_restore_userptr_worker got interrupted. At next rescheduled next attempt we use stale mem->range->notifier_seq to compare with mem->range->notifier->invalidate_seq.

I will send a patch based on amd-staging-drm-next and amd-staging-dkms-6.1.

Regards

Xiaogang


Regards,
  Felix



Signed-off-by: Xiaogang Chen <Xiaogang.Chen@xxxxxxx>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 13 ++++++++++---
  1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 7b1f5933ebaa..d0c224703278 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -2581,11 +2581,18 @@ static int confirm_valid_user_pages_locked(struct amdkfd_process_info *process_i
            mem->range = NULL;
          if (!valid) {
-            WARN(!mem->invalid, "Invalid BO not marked invalid");
+            if (!mem->invalid)
+                pr_debug("Invalid BO not marked invalid\n");
+
+            ret = -EAGAIN;
+            continue;
+        }
+
+        if (mem->invalid) {
+            pr_debug("Valid BO is marked invalid\n");
              ret = -EAGAIN;
              continue;
          }
-        WARN(mem->invalid, "Valid BO is marked invalid");
            list_move_tail(&mem->validate_list.head,
                     &process_info->userptr_valid_list);
@@ -2648,7 +2655,7 @@ static void amdgpu_amdkfd_restore_userptr_worker(struct work_struct *work)
          goto unlock_notifier_out;
        if (confirm_valid_user_pages_locked(process_info)) {
-        WARN(1, "User pages unexpectedly invalid");
+        pr_debug("User pages unexpectedly invalid, reschedule another attempt\n");
          goto unlock_notifier_out;
      }



[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux