Re: [PATCH v3] drm/amdkfd: Use partial migrations in GPU page faults

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 9/29/2023 2:48 PM, Felix Kuehling wrote:

On 2023-09-29 15:44, Chen, Xiaogang wrote:

On 9/29/2023 1:41 PM, Felix Kuehling wrote:

On 2023-09-29 14:25, Chen, Xiaogang wrote:

On 9/28/2023 4:26 PM, Felix Kuehling wrote:
On 2023-09-20 13:32, Xiaogang.Chen wrote:
From: Xiaogang Chen <xiaogang.chen@xxxxxxx>

This patch implements partial migration in gpu page fault according to migration granularity(default 2MB) and not split svm range in cpu page fault handling. A svm range may include pages from both system ram and vram of one gpu now. These chagnes are expected to improve migration performance and reduce mmu
callback and TLB flush workloads.

Signed-off-by: xiaogang chen<xiaogang.chen@xxxxxxx>

Some more nit-picks inline.
[snip]
@@ -202,12 +200,24 @@ svm_range_dma_map_dev(struct amdgpu_device *adev, struct svm_range *prange,
  static int
  svm_range_dma_map(struct svm_range *prange, unsigned long *bitmap,
            unsigned long offset, unsigned long npages,
-          unsigned long *hmm_pfns)
+          unsigned long *hmm_pfns, uint64_t *vram_pages)
  {
      struct kfd_process *p;
      uint32_t gpuidx;
+    struct page **page;
      int r;
  +    page = kvcalloc(npages, sizeof(*page), GFP_KERNEL);

Is there a reason this needs to be 0-initialized? The loop below initializes all elements.
Will use kvmalloc_array instead. This array is not needed to init during alloc as it will  be set afterword.

I'm also not happy about having to allocate this page array here. It may be justified if the repeated calls to hmm_pfn_to_page were expensive compared to the memory allocation and initialization. I'm not convinced that's the case, though. With CONFIG_SPARSEMEM_VMEMMAP, hmm_pfn_to_page basically boils down to __pfn_to_page, which is a macro that does just this:

#define __pfn_to_page(pfn)    (vmemmap + (pfn))

See https://elixir.bootlin.com/linux/v6.5.5/source/include/asm-generic/memory_model.h#L37.

We get page* array here, then we do not need get page at svm_range_dma_map_dev for each device since pages are same for all gpu devices. So we save calling hmm_pfn_to_page though calling hmm_pfn_to_page cost is low. Here we also want get vram page number during svm_range_dma_map.

I think you do not want alloc a page array each time we do svm_range_dma_map? then can we put this struct page **page at svm_range, and alloc it each time create new prange(and clone/split prange)?

Allocating and freeing memory in every dmamap call adds a point of failure and non-deterministic overhead for the memory allocation. It's non-deterministic because it depends on current memory pressure situation and if the array is large, it also needs to update the kernel page table and potentially flush CPU TLBs.

Allocating the page pointer array permanently in the prange structure would be even worse, because that increases our system memory overhead. For each GB of virtual address space we manage, we waste 2MB of system memory for the page arrays. To make matters worse, those page arrays are allocated in kernel mode where they are not swappable. It's like taking physical DIMM sticks out of your system. We worked hard to eliminate redundant address and page array allocations from the driver a few years ago.

I'd prefer a small amount of redundant address calculations over the memory or non-deterministic allocation overhead.

Then I think we can cal vram_pages atsvm_range_validate_and_map as the last version:

+for (int i = 0; i < npages; i++) {

 +if (is_zone_device_page(hmm_pfn_to_page(hmm_range->hmm_pfns[i])))

vram_pages++;

 +}

That will not alloc page* array, but have additional calling hmm_pfn_to_page.

Or add a parameter at svm_range_dma_map_dev to indicate whether cal vram_pages, and only one svm_range_dma_map_dev call will cal vram_pages.

svm_range_dma_map_dev has the is_zone_device_page(hmm_pfn_to_page(...)) check already. Always counting VRAM pages in svm_range_dma_map_dev would be trivial. Adding a condition would only make it slower.

Whether or not use use the vram_pages count in the end is a different matter. But I think just unconditionally adding vram_pages++ in svm_range_dma_map_dev would have negligible overhead.

ok, adding a check at svm_range_dma_map_dev would spoil code cache or branch prediction. Always do vram_pages++ even though get same result for all gpu devices.

Regards

Xiaogang


Regards,
  Felix



Regards

Xiaogang

Regards,
  Felix




+    if (!page)
+        return -ENOMEM;
+
+    *vram_pages = 0;
+    for (int i = 0; i < npages; i++) {
+        page[i] = hmm_pfn_to_page(hmm_pfns[i]);
+        if (is_zone_device_page(page[i]))
+            (*vram_pages)++;
+    }
+
      p = container_of(prange->svms, struct kfd_process, svms);
        for_each_set_bit(gpuidx, bitmap, MAX_GPU_INSTANCE) {
@@ -221,11 +231,12 @@ svm_range_dma_map(struct svm_range *prange, unsigned long *bitmap,
          }
            r = svm_range_dma_map_dev(pdd->dev->adev, prange, offset, npages,
-                      hmm_pfns, gpuidx);
+                      hmm_pfns, gpuidx, page);
          if (r)
              break;
      }
  +    kvfree(page);
      return r;
  }
  @@ -347,6 +358,7 @@ svm_range *svm_range_new(struct svm_range_list *svms, uint64_t start,
      INIT_LIST_HEAD(&prange->child_list);
      atomic_set(&prange->invalid, 0);
      prange->validate_timestamp = 0;
+    prange->vram_pages = 0;
      mutex_init(&prange->migrate_mutex);
      mutex_init(&prange->lock);
  @@ -393,6 +405,8 @@ static void svm_range_bo_release(struct kref *kref)
               prange->start, prange->last);
          mutex_lock(&prange->lock);
          prange->svm_bo = NULL;
+        /* prange does not hold vram page now */
+        prange->actual_loc = 0;

If you need this here, something probably went wrong elsewhere. Before we release the BO, we should have migrated everything to system memory, and actual_loc should already be 0. If anything, I'd add a WARN_ON(prange->actual_loc) here.

Will replace by WARN_ON as mentioned before.

mutex_unlock(&prange->lock);
            spin_lock(&svm_bo->list_lock);
@@ -966,6 +980,11 @@ svm_range_split_nodes(struct svm_range *new, struct svm_range *old,
      new->svm_bo = svm_range_bo_ref(old->svm_bo);
      new->ttm_res = old->ttm_res;
  +    /* set new's vram_pages as old range's now, the acurate vram_pages
+     * will be updated during mapping
+     */
+    new->vram_pages = min(old->vram_pages, new->npages);
+
      spin_lock(&new->svm_bo->list_lock);
      list_add(&new->svm_bo_list, &new->svm_bo->range_list);
      spin_unlock(&new->svm_bo->list_lock);
@@ -1610,6 +1629,7 @@ static int svm_range_validate_and_map(struct mm_struct *mm,
      struct svm_validate_context *ctx;
      unsigned long start, end, addr;
      struct kfd_process *p;
+    uint64_t vram_pages;
      void *owner;
      int32_t idx;
      int r = 0;
@@ -1678,11 +1698,13 @@ static int svm_range_validate_and_map(struct mm_struct *mm,
          }
      }
  +    vram_pages = 0;
      start = prange->start << PAGE_SHIFT;
      end = (prange->last + 1) << PAGE_SHIFT;
      for (addr = start; addr < end && !r; ) {
          struct hmm_range *hmm_range;
          struct vm_area_struct *vma;
+        uint64_t vram_pages_vma;
          unsigned long next;
          unsigned long offset;
          unsigned long npages;
@@ -1711,12 +1733,14 @@ static int svm_range_validate_and_map(struct mm_struct *mm,
            offset = (addr - start) >> PAGE_SHIFT;
          r = svm_range_dma_map(prange, ctx->bitmap, offset, npages,
-                      hmm_range->hmm_pfns);
+                      hmm_range->hmm_pfns, &vram_pages_vma);
          if (r) {
              pr_debug("failed %d to dma map range\n", r);
              goto unreserve_out;
          }
  +        vram_pages += vram_pages_vma;
+
          svm_range_lock(prange);
          if (amdgpu_hmm_range_get_pages_done(hmm_range)) {
              pr_debug("hmm update the range, need validate again\n"); @@ -1738,8 +1762,20 @@ static int svm_range_validate_and_map(struct mm_struct *mm,
          addr = next;
      }
  -    if (addr == end)
+    if (addr == end) {
          prange->mapped_to_gpu = true;
+        prange->vram_pages = vram_pages;
+
+        /* if pragne does not include any vram page and it

Typo: prange

ok.

+         * has not released svm_bo drop its svm_bo reference
+         */
+        if (!vram_pages && prange->ttm_res) {
+            svm_range_vram_node_free(prange);
+            mutex_lock(&prange->lock);

Is this lock really needed.

same as before, will remove mutex_lock/unlock(&prange->lock) as it seems overprotected.

+            prange->actual_loc = 0;
+            mutex_unlock(&prange->lock);
+        }
+    }
    unreserve_out:
      svm_range_unreserve_bos(ctx);
@@ -1996,6 +2032,7 @@ static struct svm_range *svm_range_clone(struct svm_range *old)
      new->actual_loc = old->actual_loc;
      new->granularity = old->granularity;
      new->mapped_to_gpu = old->mapped_to_gpu;
+    new->vram_pages = old->vram_pages;
      bitmap_copy(new->bitmap_access, old->bitmap_access, MAX_GPU_INSTANCE);       bitmap_copy(new->bitmap_aip, old->bitmap_aip, MAX_GPU_INSTANCE);   @@ -2903,6 +2940,7 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
              uint32_t vmid, uint32_t node_id,
              uint64_t addr, bool write_fault)
  {
+    unsigned long start, last, size;
      struct mm_struct *mm = NULL;
      struct svm_range_list *svms;
      struct svm_range *prange;
@@ -3038,32 +3076,38 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,       kfd_smi_event_page_fault_start(node, p->lead_thread->pid, addr,
                         write_fault, timestamp);
  -    if (prange->actual_loc != best_loc) {
+    if (prange->actual_loc != 0 || best_loc != 0) {
          migration = true;
+        /* Align migration range start and size to granularity size */
+        size = 1UL << prange->granularity;
+        start = ALIGN_DOWN(addr, size);
+        last = ALIGN(addr + 1, size) - 1;
+
+        start = (start >= prange->start) ? start : prange->start;
+        last = (last <= prange->last) ? last : prange->last;

Similar to the simplification I recommended in svm_migrate_to_ram, this could be:

    start = max(ALIGN_DOWN(addr, size), prange->start);
    last = min(ALIGN(addr + 1, size) - 1, prange->last);

ok, overlooked last time.

Regards

Xiaogang

Regards,
  Felix


+
          if (best_loc) {
-            r = svm_migrate_to_vram(prange, best_loc, mm,
-                    KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU);
+            r = svm_migrate_to_vram(prange, best_loc, start, last,
+                    mm, KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU);
              if (r) {
                  pr_debug("svm_migrate_to_vram failed (%d) at %llx, falling back to system memory\n",
                       r, addr);
                  /* Fallback to system memory if migration to
                   * VRAM failed
                   */
-                if (prange->actual_loc)
-                    r = svm_migrate_vram_to_ram(prange, mm,
- KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU,
-                       NULL);
+                if (prange->actual_loc && prange->actual_loc != best_loc) +                    r = svm_migrate_vram_to_ram(prange, mm, start, last,
+ KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU, NULL);
                  else
                      r = 0;
              }
          } else {
-            r = svm_migrate_vram_to_ram(prange, mm,
-                    KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU,
-                    NULL);
+            r = svm_migrate_vram_to_ram(prange, mm, start, last,
+                    KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU, NULL);
          }
          if (r) {
              pr_debug("failed %d to migrate svms %p [0x%lx 0x%lx]\n",
-                 r, svms, prange->start, prange->last);
+                r, svms, start, last);
              goto out_unlock_range;
          }
      }
@@ -3417,18 +3461,24 @@ svm_range_trigger_migration(struct mm_struct *mm, struct svm_range *prange,
      *migrated = false;
      best_loc = svm_range_best_prefetch_location(prange);
  -    if (best_loc == KFD_IOCTL_SVM_LOCATION_UNDEFINED ||
-        best_loc == prange->actual_loc)
+    /* when best_loc is a gpu node and same as prange->actual_loc
+     * we still need do migration as prange->actual_loc !=0 does
+     * not mean all pages in prange are vram. hmm migrate will pick
+     * up right pages during migration.
+     */
+    if ((best_loc == KFD_IOCTL_SVM_LOCATION_UNDEFINED) ||
+        (best_loc == 0 && prange->actual_loc == 0))
          return 0;
        if (!best_loc) {
-        r = svm_migrate_vram_to_ram(prange, mm,
+        r = svm_migrate_vram_to_ram(prange, mm, prange->start, prange->last,
                      KFD_MIGRATE_TRIGGER_PREFETCH, NULL);
          *migrated = !r;
          return r;
      }
  -    r = svm_migrate_to_vram(prange, best_loc, mm, KFD_MIGRATE_TRIGGER_PREFETCH); +    r = svm_migrate_to_vram(prange, best_loc, prange->start, prange->last,
+                    mm, KFD_MIGRATE_TRIGGER_PREFETCH);
      *migrated = !r;
        return r;
@@ -3483,7 +3533,11 @@ static void svm_range_evict_svm_bo_worker(struct work_struct *work)
            mutex_lock(&prange->migrate_mutex);
          do {
+            /* migrate all vram pages in this prange to sys ram
+             * after that prange->actual_loc should be zero
+             */
              r = svm_migrate_vram_to_ram(prange, mm,
+                    prange->start, prange->last,
                      KFD_MIGRATE_TRIGGER_TTM_EVICTION, NULL);
          } while (!r && prange->actual_loc && --retries);
  diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
index 5fd958a97a28..8574cc5eeabd 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
@@ -78,6 +78,7 @@ struct svm_work_list_item {
   * @update_list:link list node used to add to update_list
   * @mapping:    bo_va mapping structure to create and update GPU page table
   * @npages:     number of pages
+ * @vram_pages: vram pages number in this svm_range
   * @dma_addr:   dma mapping address on each GPU for system memory physical page
   * @ttm_res:    vram ttm resource map
   * @offset:     range start offset within mm_nodes
@@ -88,7 +89,9 @@ struct svm_work_list_item {
   * @flags:      flags defined as KFD_IOCTL_SVM_FLAG_*
   * @perferred_loc: perferred location, 0 for CPU, or GPU id
   * @perfetch_loc: last prefetch location, 0 for CPU, or GPU id
- * @actual_loc: the actual location, 0 for CPU, or GPU id
+ * @actual_loc: this svm_range location. 0: all pages are from sys ram; + *              GPU id: this svm_range may include vram pages from GPU with
+ *              id actual_loc.
   * @granularity:migration granularity, log2 num pages
   * @invalid:    not 0 means cpu page table is invalidated
   * @validate_timestamp: system timestamp when range is validated
@@ -112,6 +115,7 @@ struct svm_range {
      struct list_head        list;
      struct list_head        update_list;
      uint64_t            npages;
+    uint64_t            vram_pages;
      dma_addr_t            *dma_addr[MAX_GPU_INSTANCE];
      struct ttm_resource        *ttm_res;
      uint64_t            offset;



[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux