Re: Need to support mixed memory mappings with HMM

Christian König <christian.koenig@xxxxxxx> · Thu, 25 Mar 2021 17:22:30 +0100

Am 25.03.21 um 17:20 schrieb Felix Kuehling:
Am 2021-03-25 um 12:16 p.m. schrieb Christian König:
Am 25.03.21 um 17:14 schrieb Felix Kuehling:
Am 2021-03-25 um 12:10 p.m. schrieb Christian König:
Am 25.03.21 um 17:03 schrieb Felix Kuehling:
Hi,

This is a long one with a proposal for a pretty significant
redesign of
how we handle migrations and VRAM management with HMM. This is
based on
my own debugging and reading of the migrate_vma helpers, as well as
Alex's problems with migrations on A+A. I hope we can discuss this
next
Monday after you've had some time do digest it.

I did some debugging yesterday and found that migrations to VRAM can
fail for some pages. The current migration helpers have many corner
cases where a page cannot be migrated. Some of them may be fixable
(adding support for THP), others are not (locked pages are skipped to
avoid deadlocks). Therefore I think our current code is too inflexible
when it assumes that a range is entirely in one place.

Alex also ran into some funny issues with COW on A+A where some pages
get faulted back to system memory. I think a lot of the problems here
will get easier once we support mixed mappings.

Mixed GPU mappings
===========

The idea is, to remove any assumptions that an entire svm_range is in
one place. Instead hmm_range_fault gives us a list of pages, some of
which are system memory and others are device_private or
device_generic.

We will need an amdgpu_vm interface that lets us map mixed page arrays
where different pages use different PTE flags. We can have at least 3
different types of pages in one mapping:

    1. System memory (S-bit set)
    2. Local memory (S-bit cleared, MTYPE for local memory)
    3. Remote XGMI memory (S-bit cleared, MTYPE+C for remote memory)

My idea is to change the amdgpu_vm_update_mapping interface to use
some
high-bit in the pages_addr array to indicate the page type. For the
default page type (0) nothing really changes for the callers. The
"flags" parameter needs to become a pointer to an array that gets
indexed by the high bits from the pages_addr array. For existing
callers
it's as easy as changing flags to &flags (array of size 1). For HMM we
would pass a pointer to a real array.
Yeah, I've thought about stuff like that as well for a while.

Problem is that this won't work that easily. We assume at many places
that the flags doesn't change for the range in question.
I think some lower level functions assume that the flags stay the same
for physically contiguous ranges. But if you use the high-bits to encode
the page type, the ranges won't be contiguous any more. So you can
change page flags for different contiguous ranges.
Yeah, but then you also get absolutely zero THP and fragment flags
support.
As long as you have a contiguous 2MB page with the same page type, I
think you can still get a THP mapping in the GPU page table. But if one
page in the middle of a 2MB page has a different page type, that will
break the THP mapping, as it should.

Yeah, but currently we detect that before we call down into the 
functions to update the tables.

When you give a mixed list the chance for that is just completely gone.

Regards,
Christian.

Regards,
   Felix

But I think we could also add those later on.

Regards,
Christian.

Regards,
    Felix

We would somehow need to change that to get the flags directly from
the low bits of the DMA address or something instead.

Christian.

Once this is done, it leads to a number of opportunities for
simplification and better efficiency in kfd_svm:

     * Support migration when cpages != npages
     * Migrate a part of an svm_range without splitting it. No more
       splitting of ranges in CPU page faults
     * Migrate a part of an svm_range in GPU page fault handler. No
more
       migrating the whole range for a single page fault
     * Simplified VRAM management (see below)

With that, svm_range will no longer have an "actual_loc" field. If
we're
not sure where the data is, we need to call migrate. If it's
already in
the right place, then cpages will be 0 and we can skip all the steps
after migrate_vma_setup.

Simplified VRAM management
==============

VRAM BOs are no longer associated with pranges. Instead they are
"free-floating", allocated during migration to VRAM, with reference
count for each page that uses the BO. Ref is released in page-release
callback. When the ref count drops to 0, free the BO.

VRAM BO size should match the migration granularity, 2MB by default.
That way the BO can be freed when memory gets migrated out. If
migration
of some pages fails the BO may not be fully occupied. Also some pages
may be released individually on A+A due to COW or other events.

Eviction needs to migrate all the pages still using the BO. If the BO
struct keeps an array of page pointers, that's basically the
migrate.src
for the eviction. Migration calls "try_to_unmap", which has the best
chance of freeing the BO, even when shared by multiple processes.

If we cannot guarantee eviction of pages, we cannot use TTM for VRAM
allocations. Need to use amdgpu_vram_mgr. Need a way to detect memory
pressure so we can start evicting memory.

Regards,
     Felix

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx