Need to support mixed memory mappings with HMM

Felix Kuehling <felix.kuehling@xxxxxxx> · Thu, 25 Mar 2021 12:03:29 -0400

Hi,

This is a long one with a proposal for a pretty significant redesign of
how we handle migrations and VRAM management with HMM. This is based on
my own debugging and reading of the migrate_vma helpers, as well as
Alex's problems with migrations on A+A. I hope we can discuss this next
Monday after you've had some time do digest it.

I did some debugging yesterday and found that migrations to VRAM can
fail for some pages. The current migration helpers have many corner
cases where a page cannot be migrated. Some of them may be fixable
(adding support for THP), others are not (locked pages are skipped to
avoid deadlocks). Therefore I think our current code is too inflexible
when it assumes that a range is entirely in one place.

Alex also ran into some funny issues with COW on A+A where some pages
get faulted back to system memory. I think a lot of the problems here
will get easier once we support mixed mappings.

Mixed GPU mappings
===========

The idea is, to remove any assumptions that an entire svm_range is in
one place. Instead hmm_range_fault gives us a list of pages, some of
which are system memory and others are device_private or device_generic.

We will need an amdgpu_vm interface that lets us map mixed page arrays
where different pages use different PTE flags. We can have at least 3
different types of pages in one mapping:

 1. System memory (S-bit set)
 2. Local memory (S-bit cleared, MTYPE for local memory)
 3. Remote XGMI memory (S-bit cleared, MTYPE+C for remote memory)

My idea is to change the amdgpu_vm_update_mapping interface to use some
high-bit in the pages_addr array to indicate the page type. For the
default page type (0) nothing really changes for the callers. The
"flags" parameter needs to become a pointer to an array that gets
indexed by the high bits from the pages_addr array. For existing callers
it's as easy as changing flags to &flags (array of size 1). For HMM we
would pass a pointer to a real array.

Once this is done, it leads to a number of opportunities for
simplification and better efficiency in kfd_svm:

  * Support migration when cpages != npages
  * Migrate a part of an svm_range without splitting it. No more
    splitting of ranges in CPU page faults
  * Migrate a part of an svm_range in GPU page fault handler. No more
    migrating the whole range for a single page fault
  * Simplified VRAM management (see below)

With that, svm_range will no longer have an "actual_loc" field. If we're
not sure where the data is, we need to call migrate. If it's already in
the right place, then cpages will be 0 and we can skip all the steps
after migrate_vma_setup.

Simplified VRAM management
==============

VRAM BOs are no longer associated with pranges. Instead they are
"free-floating", allocated during migration to VRAM, with reference
count for each page that uses the BO. Ref is released in page-release
callback. When the ref count drops to 0, free the BO.

VRAM BO size should match the migration granularity, 2MB by default.
That way the BO can be freed when memory gets migrated out. If migration
of some pages fails the BO may not be fully occupied. Also some pages
may be released individually on A+A due to COW or other events.

Eviction needs to migrate all the pages still using the BO. If the BO
struct keeps an array of page pointers, that's basically the migrate.src
for the eviction. Migration calls "try_to_unmap", which has the best
chance of freeing the BO, even when shared by multiple processes.

If we cannot guarantee eviction of pages, we cannot use TTM for VRAM
allocations. Need to use amdgpu_vram_mgr. Need a way to detect memory
pressure so we can start evicting memory.

Regards,
  Felix

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx