On 1/12/2025 8:02 PM, Deng, Emily wrote:
[AMD Official Use Only - AMD Internal Distribution Only]
-----Original Message-----
From: Chen, Xiaogang <Xiaogang.Chen@xxxxxxx>
Sent: Saturday, January 11, 2025 2:13 AM
To: Yang, Philip <Philip.Yang@xxxxxxx>; Deng, Emily <Emily.Deng@xxxxxxx>;
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [PATCH v4] drm/amdkfd: Fix partial migrate issue
On 1/10/2025 11:33 AM, Philip Yang wrote:
On 2025-01-10 11:23, Chen, Xiaogang wrote:
On 1/10/2025 8:37 AM, Philip Yang wrote:
On 2025-01-10 02:49, Emily Deng wrote:
For partial migrate from ram to vram, the migrate->cpages is not
equal to migrate->npages, should use migrate->npages to check all
needed migrate pages which could be copied or not.
And only need to set those pages could be migrated to
migrate->dst[i], or
the migrate_vma_pages will migrate the wrong pages based on the
migrate->dst[i].
v2:
Add mpages to break the loop earlier.
v3:
Uses MIGRATE_PFN_MIGRATE to identify whether page could be migrated.
The error handling need below change, with that fixed, this patch is
Reviewed-by: Philip Yang<Philip.Yang@xxxxxxx>
Signed-off-by: Emily Deng<Emily.Deng@xxxxxxx>
---
drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 4b275937d05e..bfaccabeb3a0 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -278,10 +278,11 @@ svm_migrate_copy_to_vram(struct kfd_node
*node, struct svm_range *prange,
struct migrate_vma *migrate, struct dma_fence
**mfence,
dma_addr_t *scratch, uint64_t ttm_res_offset)
{
- uint64_t npages = migrate->cpages;
+ uint64_t npages = migrate->npages;
struct amdgpu_device *adev = node->adev;
struct device *dev = adev->dev;
struct amdgpu_res_cursor cursor;
+ uint64_t mpages = 0;
dma_addr_t *src;
uint64_t *dst;
uint64_t i, j;
@@ -295,14 +296,16 @@ svm_migrate_copy_to_vram(struct kfd_node
*node, struct svm_range *prange,
amdgpu_res_first(prange->ttm_res, ttm_res_offset,
npages << PAGE_SHIFT, &cursor);
- for (i = j = 0; i < npages; i++) {
+ for (i = j = 0; (i < npages) && (mpages < migrate->cpages);
i++) {
struct page *spage;
- dst[i] = cursor.start + (j << PAGE_SHIFT);
- migrate->dst[i] = svm_migrate_addr_to_pfn(adev, dst[i]);
- svm_migrate_get_vram_page(prange, migrate->dst[i]);
- migrate->dst[i] = migrate_pfn(migrate->dst[i]);
-
+ if (migrate->src[i] & MIGRATE_PFN_MIGRATE) {
+ dst[i] = cursor.start + (j << PAGE_SHIFT);
+ migrate->dst[i] = svm_migrate_addr_to_pfn(adev,
+dst[i]);
+ svm_migrate_get_vram_page(prange, migrate->dst[i]);
+ migrate->dst[i] = migrate_pfn(migrate->dst[i]);
+ mpages++;
+ }
spage = migrate_pfn_to_page(migrate->src[i]);
if (spage && !is_zone_device_page(spage)) {
src[i] = dma_map_page(dev, spage, 0, PAGE_SIZE,
out_free_vram_pages:
if (r) {
pr_debug("failed %d to copy memory to vram\n", r);
- while (i--) {
+
+ for (i = 0; i < npages && mpages; i++) {
+ if (!dst[i])
+ continue;
svm_migrate_put_vram_page(adev, dst[i]);
migrate->dst[i] = 0;
+ mpages--;
}
}
This error handing not need recover all vram pages as error happened
at middle. Can use se do {....} while(i--);
no, for example migrate npage=cpage=4, and outside for loop,
svm_migrate_copy_memory_gart failed, dst[4] is out of range access.
You are right, but it is awkward. Inside loop we update dst[i] before do sdam that does
not include the dst[i] just updated.
As the dst[i] is already initialized to NULL, so I think it doesn't matter if use follow:
while (i--) {
if (dst[i])
svm_migrate_put_vram_page(adev, dst[i]);
migrate->dst[i] = 0;
}
The issue at error recovery path is due to we handled sys ram(src) and
dst(vram) discontinuation in different way. When src got discontinuation
we migrate j pages that page ith is not migrated. When dst(vram) got
discontinuation we migrate j+1 pages that page ith is migrated. That
cause error path has to iterate all pages to find which page got
migrated before error happened. Also makes code more difficult to read.
Besides your change I think we can also change inside loop that handle
src and dst discontinuation in consistent way.
Emily Deng
Best Wishes
Regard
Xiaogang
Regards
Xiaogang