On 4/10/2023 2:58 PM, Felix Kuehling wrote:
On 2023-04-10 10:36, Xiaogang.Chen wrote:
From: Xiaogang Chen <xiaogang.chen@xxxxxxx>
During KFD restore evicted userptr BOs mmu invalidate callback may
invalidate
same userptr BOs that have been just restored. When KFD restore
process detects
it KFD will reschedule another validation process. It is not an
error. Change
WARN to pr_debug, not put the BOs at userptr_valid_list, let next
scheduled
delayed work validate them again.
The problem is not, that a concurrent MMU notifier invalidated the
pages. The problem is, that the sequence number and the mem->inval
flag disagree on this. In theory, both the sequence number and the
mem->inval flag are updated by amdgpu_amdkfd_evict_userptr in the same
critical section.
When we validate the BO, we set mem->valid to true. If mem->valid gets
set back to false later, the sequence number should also be updated so
that amdgpu_ttm_tt_get_user_pages_done should return false. So
mem->valid and the sequence number should agree on whether the memory
is valid. However, these WARNs indicate that there is a mismatch. If
that happens, it means something went wrong. Some of the code's
assumptions are being violated and this justifies a WARN.
yes, mem->invalid change and mmu_interval_set_seq(mni, cur_seq) at
amdgpu_amdkfd_evict_userptr are under process_info->notifier_lock
protection, but mem->range->notifier_seq could be stale when mmu
notifier happens concurrently.
I think you mentioned that you only see the warnings with the DKMS
driver. I suspect this is happening on some old get_user_pages code
path, not the current hmm_range_fault-based one. I would recommend
looking into this problem on the DKMS branch and addressing the
problem there. This should not be fixed but removing legitimate WARNs
on the upstream branch.
It happened with dkms driver with HAVE_AMDKCL_HMM_MIRROR_ENABLED
enabled, it is not old get_user_pages path. I retraced this part code. I
think there are two issues:
1: amdgpu_ttm_tt_get_user_pages can fail(-EFAULT). If it failed we
should not set it valid(mem->invalid = 0). In this case mem has no
associated hmm range associated.
2: mmu notifier can happen concurrently and update
mem->range->notifier->invalidate_seq, but not mem->range->notifier_seq.
That causes mem->range->notifier_seq stale when mem is in
process_info->userptr_inval_list and
amdgpu_amdkfd_restore_userptr_worker got interrupted. At next
rescheduled next attempt we use stale mem->range->notifier_seq to
compare with mem->range->notifier->invalidate_seq.
I will send a patch based on amd-staging-drm-next and amd-staging-dkms-6.1.
Regards
Xiaogang
Regards,
Felix
Signed-off-by: Xiaogang Chen <Xiaogang.Chen@xxxxxxx>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 7b1f5933ebaa..d0c224703278 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -2581,11 +2581,18 @@ static int
confirm_valid_user_pages_locked(struct amdkfd_process_info *process_i
mem->range = NULL;
if (!valid) {
- WARN(!mem->invalid, "Invalid BO not marked invalid");
+ if (!mem->invalid)
+ pr_debug("Invalid BO not marked invalid\n");
+
+ ret = -EAGAIN;
+ continue;
+ }
+
+ if (mem->invalid) {
+ pr_debug("Valid BO is marked invalid\n");
ret = -EAGAIN;
continue;
}
- WARN(mem->invalid, "Valid BO is marked invalid");
list_move_tail(&mem->validate_list.head,
&process_info->userptr_valid_list);
@@ -2648,7 +2655,7 @@ static void
amdgpu_amdkfd_restore_userptr_worker(struct work_struct *work)
goto unlock_notifier_out;
if (confirm_valid_user_pages_locked(process_info)) {
- WARN(1, "User pages unexpectedly invalid");
+ pr_debug("User pages unexpectedly invalid, reschedule
another attempt\n");
goto unlock_notifier_out;
}