Re: [PATCH] drm/amdkfd: Change WARN to pr_debug when same userptr BOs got invalidated by mmu.

"Chen, Xiaogang" <xiaogang.chen@xxxxxxx> · Wed, 12 Apr 2023 00:42:22 -0500

On 4/10/2023 2:58 PM, Felix Kuehling wrote:
On 2023-04-10 10:36, Xiaogang.Chen wrote:
From: Xiaogang Chen <xiaogang.chen@xxxxxxx>

During KFD restore evicted userptr BOs mmu invalidate callback may 
invalidate
same userptr BOs that have been just restored. When KFD restore 
process detects
it KFD will reschedule another validation process. It is not an 
error. Change
WARN to pr_debug, not put the BOs at userptr_valid_list, let next 
scheduled
delayed work validate them again.

The problem is not, that a concurrent MMU notifier invalidated the 
pages. The problem is, that the sequence number and the mem->inval 
flag disagree on this. In theory, both the sequence number and the 
mem->inval flag are updated by amdgpu_amdkfd_evict_userptr in the same 
critical section.

When we validate the BO, we set mem->valid to true. If mem->valid gets 
set back to false later, the sequence number should also be updated so 
that amdgpu_ttm_tt_get_user_pages_done should return false. So 
mem->valid and the sequence number should agree on whether the memory 
is valid. However, these WARNs indicate that there is a mismatch. If 
that happens, it means something went wrong. Some of the code's 
assumptions are being violated and this justifies a WARN.

yes, mem->invalid change and mmu_interval_set_seq(mni, cur_seq) at 
amdgpu_amdkfd_evict_userptr are under process_info->notifier_lock 
protection, but mem->range->notifier_seq could be stale when mmu 
notifier happens concurrently.
I think you mentioned that you only see the warnings with the DKMS 
driver. I suspect this is happening on some old get_user_pages code 
path, not the current hmm_range_fault-based one. I would recommend 
looking into this problem on the DKMS branch and addressing the 
problem there. This should not be fixed but removing legitimate WARNs 
on the upstream branch.

It happened with dkms driver with HAVE_AMDKCL_HMM_MIRROR_ENABLED 
enabled, it is not old get_user_pages path. I retraced this part code. I 
think there are two issues:

1: amdgpu_ttm_tt_get_user_pages can fail(-EFAULT). If it failed we 
should not set it valid(mem->invalid = 0). In this case mem has no 
associated hmm range associated.

2: mmu notifier can happen concurrently and update 
mem->range->notifier->invalidate_seq, but not mem->range->notifier_seq. 
That causes mem->range->notifier_seq stale when mem is in 
process_info->userptr_inval_list and 
amdgpu_amdkfd_restore_userptr_worker got interrupted. At next 
rescheduled next attempt we use stale mem->range->notifier_seq to 
compare with mem->range->notifier->invalidate_seq.

I will send a patch based on amd-staging-drm-next and amd-staging-dkms-6.1.

Regards

Xiaogang


Regards,
  Felix



Signed-off-by: Xiaogang Chen <Xiaogang.Chen@xxxxxxx>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 13 ++++++++++---
  1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 7b1f5933ebaa..d0c224703278 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -2581,11 +2581,18 @@ static int 
confirm_valid_user_pages_locked(struct amdkfd_process_info *process_i
            mem->range = NULL;
          if (!valid) {
-            WARN(!mem->invalid, "Invalid BO not marked invalid");
+            if (!mem->invalid)
+                pr_debug("Invalid BO not marked invalid\n");
+
+            ret = -EAGAIN;
+            continue;
+        }
+
+        if (mem->invalid) {
+            pr_debug("Valid BO is marked invalid\n");
              ret = -EAGAIN;
              continue;
          }
-        WARN(mem->invalid, "Valid BO is marked invalid");
            list_move_tail(&mem->validate_list.head,
                     &process_info->userptr_valid_list);
@@ -2648,7 +2655,7 @@ static void 
amdgpu_amdkfd_restore_userptr_worker(struct work_struct *work)
          goto unlock_notifier_out;
        if (confirm_valid_user_pages_locked(process_info)) {
-        WARN(1, "User pages unexpectedly invalid");
+        pr_debug("User pages unexpectedly invalid, reschedule 
another attempt\n");
          goto unlock_notifier_out;
      }