On 2025-01-31 11:58, Xiaogang.Chen wrote: > From: Xiaogang Chen <xiaogang.chen@xxxxxxx> > > When register a vm range at svm the added vm range may be split into multiple > subranges and/or existing pranges got spitted. The new pranges need validated > and mapped. This patch changes error handling for pranges that fail updating: It may help if you clearly state the problem you're trying to solve to justify the changes in error handling. See more comments inline. > > 1: free prange resources and remove it from svms if its updating fails as it > will not be used. > 2: return -EAGAIN when all pranges at update_list need redo valid/map, > otherwise return no -EAGAIN error to user space to indicate failure. That > removes unnecessary retries. > > Signed-off-by: Xiaogang Chen <xiaogang.chen@xxxxxxx> > --- > drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 27 +++++++++++++++++++++++---- > 1 file changed, 23 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c > index e32e19196f6b..455cb98bf16a 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c > @@ -3716,8 +3716,19 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm, > > out_unlock_range: > mutex_unlock(&prange->migrate_mutex); > - if (r) > - ret = r; > + /* this prange cannot be migraed, valid or map */ > + if (r) { > + /* free this prange resources, remove it from svms */ > + svm_range_unlink(prange); > + svm_range_remove_notifier(prange); > + svm_range_free(prange, false); Freeing the prange removes SVM mappings from the process. This will break the subsequent execution of the application. In case you were going to return -EAGAIN that's definitely wrong because the application would expect the SVM range to work after a successful retry. I'm not sure the resource waste is a valid argument in case of a fatal error. I would expect the application to terminate anyways in this case, which would result in freeing the resources. Do you see a scenario where an application wants to continue running after this function returned a fatal error? > + > + /* ret got update when any r != -EAGAIN; > + * return -EAGAIN when all pranges at update_list > + * need redo valid/map */ > + if (r != -EAGAIN || !ret) > + ret = r; This is a good point. But the explanation is a bit misleading: "all pranges ... need redo" is not really true. There may also be ranges that were validated successfully. I think the point you're trying to make is this: Don't return -EAGAIN if there was any previous fatal error found. I could potentially see a different optimization. If you encounter a fatal error, you can skip the rest of the ranges and return the error immediately. > + } > } > > list_for_each_entry(prange, &remap_list, update_list) { > @@ -3729,8 +3740,16 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm, > if (r) > pr_debug("failed %d on remap svm range\n", r); > mutex_unlock(&prange->migrate_mutex); > - if (r) > - ret = r; > + > + if (r) { > + /* remove this prange */ > + svm_range_unlink(prange); > + svm_range_remove_notifier(prange); > + svm_range_free(prange, false); Same as above. Regards, Felix > + > + if (r != -EAGAIN || !ret) > + ret = r; > + } > } > > dynamic_svm_range_dump(svms);