Re: [PATCH v4] drm/amdkfd: Fix partial migrate issue

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 1/12/2025 8:02 PM, Deng, Emily wrote:
[AMD Official Use Only - AMD Internal Distribution Only]

-----Original Message-----
From: Chen, Xiaogang <Xiaogang.Chen@xxxxxxx>
Sent: Saturday, January 11, 2025 2:13 AM
To: Yang, Philip <Philip.Yang@xxxxxxx>; Deng, Emily <Emily.Deng@xxxxxxx>;
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [PATCH v4] drm/amdkfd: Fix partial migrate issue


On 1/10/2025 11:33 AM, Philip Yang wrote:

On 2025-01-10 11:23, Chen, Xiaogang wrote:
On 1/10/2025 8:37 AM, Philip Yang wrote:

On 2025-01-10 02:49, Emily Deng wrote:
For partial migrate from ram to vram, the migrate->cpages is not
equal to migrate->npages, should use migrate->npages to check all
needed migrate pages which could be copied or not.

And only need to set those pages could be migrated to
migrate->dst[i], or
the migrate_vma_pages will migrate the wrong pages based on the
migrate->dst[i].

v2:
Add mpages to break the loop earlier.

v3:
Uses MIGRATE_PFN_MIGRATE to identify whether page could be migrated.
The error handling need below change, with that fixed, this patch is

Reviewed-by: Philip Yang<Philip.Yang@xxxxxxx>

Signed-off-by: Emily Deng<Emily.Deng@xxxxxxx>
---
   drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 17 ++++++++++-------
   1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 4b275937d05e..bfaccabeb3a0 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -278,10 +278,11 @@ svm_migrate_copy_to_vram(struct kfd_node
*node, struct svm_range *prange,
                struct migrate_vma *migrate, struct dma_fence
**mfence,
                dma_addr_t *scratch, uint64_t ttm_res_offset)
   {
-    uint64_t npages = migrate->cpages;
+    uint64_t npages = migrate->npages;
       struct amdgpu_device *adev = node->adev;
       struct device *dev = adev->dev;
       struct amdgpu_res_cursor cursor;
+    uint64_t mpages = 0;
       dma_addr_t *src;
       uint64_t *dst;
       uint64_t i, j;
@@ -295,14 +296,16 @@ svm_migrate_copy_to_vram(struct kfd_node
*node, struct svm_range *prange,
         amdgpu_res_first(prange->ttm_res, ttm_res_offset,
                npages << PAGE_SHIFT, &cursor);
-    for (i = j = 0; i < npages; i++) {
+    for (i = j = 0; (i < npages) && (mpages < migrate->cpages);
i++) {
           struct page *spage;
   -        dst[i] = cursor.start + (j << PAGE_SHIFT);
-        migrate->dst[i] = svm_migrate_addr_to_pfn(adev, dst[i]);
-        svm_migrate_get_vram_page(prange, migrate->dst[i]);
-        migrate->dst[i] = migrate_pfn(migrate->dst[i]);
-
+        if (migrate->src[i] & MIGRATE_PFN_MIGRATE) {
+            dst[i] = cursor.start + (j << PAGE_SHIFT);
+            migrate->dst[i] = svm_migrate_addr_to_pfn(adev,
+dst[i]);
+            svm_migrate_get_vram_page(prange, migrate->dst[i]);
+            migrate->dst[i] = migrate_pfn(migrate->dst[i]);
+            mpages++;
+        }
           spage = migrate_pfn_to_page(migrate->src[i]);
           if (spage && !is_zone_device_page(spage)) {
               src[i] = dma_map_page(dev, spage, 0, PAGE_SIZE,
  out_free_vram_pages:
         if (r) {
                 pr_debug("failed %d to copy memory to vram\n", r);
-               while (i--) {
+
+               for (i = 0; i < npages && mpages; i++) {
+                       if (!dst[i])
+                               continue;
                         svm_migrate_put_vram_page(adev, dst[i]);
                         migrate->dst[i] = 0;
+                       mpages--;
                 }
         }
This error handing not need recover all vram pages as error happened
at middle. Can use se do {....} while(i--);
no, for example migrate npage=cpage=4, and outside for loop,
svm_migrate_copy_memory_gart failed, dst[4] is out of range access.
You are right, but it is awkward. Inside loop we update dst[i] before do sdam that does
not include the dst[i] just updated.
As the dst[i] is already initialized to NULL, so I think it doesn't matter if use follow:
                 while (i--) {
                         if (dst[i])
                                 svm_migrate_put_vram_page(adev, dst[i]);
                         migrate->dst[i] = 0;
                 }

The issue at error recovery path is due to we handled sys ram(src) and dst(vram) discontinuation in different way. When src got discontinuation we migrate j pages that page ith is not migrated. When dst(vram) got discontinuation we migrate j+1 pages that page ith is migrated. That cause error path has to iterate all pages to find which page got migrated before error happened. Also makes code more difficult to read.

Besides your change I think we can also change inside loop that handle src and dst discontinuation in consistent way.

Emily Deng
Best Wishes


Regard

Xiaogang

Regards

Xiaogang




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux