Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range

Jason Gunthorpe <jgg@xxxxxxxxxx> · Wed, 24 Apr 2024 22:05:20 -0300

On Wed, Apr 24, 2024 at 11:59:18PM +0000, Zeng, Oak wrote:
> Hi Jason,
> 
> I went through the conversation b/t you and Matt. I think we are pretty much aligned. Here is what I get from this threads:
> 
> 1) hmm range fault size, gpu page table map size : you prefer bigger
> gpu vma size and vma can be sparsely mapped to gpu. Our vma size is
> configurable through a user madvise api. 

That is even worse! It is not a user tunable in any way shape or form!

> 2) invalidation: you prefer giant notifier. We can consider this if
> it turns out our implementation is not performant. Currently we
> don't know.

It is the wrong way to use the API to have many small notifiers,
please don't use it wrong.

> 3) whether driver can look up cpu vma. I think we need this for data migration purpose.

The migration code may but not the SVA/hmm_range_fault code. 

> > > What do you mean by "page by page"/" no API surface available to
> > > safely try to construct covering ranges"? As I understand it,
> > > hmm_range_fault take a virtual address range (defined in hmm_range
> > > struct), and walk cpu page table in this range. It is a range based
> > > API.
> > 
> > Yes, but invalidation is not linked to range_fault so you can get
> > invalidations of single pages. You are binding range_fault to
> > dma_map_sg but then you can't handle invalidation at all sanely.
> 
> Ok, I understand your point now.
> 
> Yes strictly speaking we can get invalidation of a single page. This
> can be triggered by core mm numa balance or ksm (kernel samepage
> merging). At present, my understanding is, single page (or a few
> pages) invalidation is not a very common case. The more common cases
> are invalidation triggered by user munmap, or invalidation triggered
> by hmm migration itself (triggered in migrate_vma_setup). I will
> experiment this.

Regardless it must be handled and unmapping an entire 'gpu vma' is a
horrible implementation of HMM.

> I agree in case of single page invalidation, the current codes is
> not performant because we invalidate the whole vma. What I can do
> is, look at the mmu_notifier_range parameter of the invalidation
> callback, and only invalidate the range. The requires our driver to
> split the vma state and page table state. It is a big change. We
> plan to do it in stage 2

Which is, again, continuing to explain why why this VMA based approach
is a completely wrong way to use hmm.

> > I understand, but SVA/hmm_range_fault/invalidation are *NOT* VMA based
> > and you do need to ensure the page table manipulation has an API that
> > is usable. "populate an entire VMA" / "invalidate an entire VMA" is
> > not a sufficient primitive.
> 
> I understand invalidate entire VMA might be not performant. I will
> improve as explained above.

Please stop saying performant. There are correct ways to use hmm and
bad ways. What you are doing is a bad way. Even if the performance is
only a little different it is still the kind of horrible code
structure I don't want to see in new hmm users.

> I think whether we want to populate entire VMA or only one page is
> still driver's selection. 

hmm_range_fault() should be driven with a prefetch fault/around
scheme. This has nothing to do with a durable VMA record and addresses
these concerns.

> Do you suggest per page based population? Or do you suggest to
> populate the entire address space or the entire memory region? I did
> look at RDMA odp codes. In function ib_umem_odp_map_dma_and_lock, it
> is also a range based population. It seems it populate the whole
> memory region each time, not very sure.

Performance here is veyr complicated. You often want to allow the
userspace to prefetch data into the GPU page table to warm it up to
avoid page faults, as faults are generally super slow and hard to
manage performantly. ODP has many options to control this in a fine
granularity.

> > You can create VMAs, but they can't be fully populated and they should
> > be alot bigger than 2M. ODP uses a similar 2 level scheme for it's
> > SVA, the "vma" granual is 512M.
> 
> Oh, I see. So you are suggesting a much bigger granularity. That
> make sense to me. Our design actually support a much bigger
> granularity. The migration/population granularity is configurable in
> our design. It is a memory advise API and one of the advise is
> called "MIGRATION_GRANULARITY".

I don't think the notifier granual should be user tunable, you are
actually doing something different - it sounds like it mostly acts as
prefetch tunable.

> > The key thing is the VMA is just a container for the notifier and
> > other data structures, it doesn't insist the range be fully populated
> > and it must support page-by-page unmap/remap/invalidateion.
> 
> Agree and I don't see a hard conflict of our implementation to this
> concept. So the linux core mm vma (struct vm_area_struct) represent
> an address range in the process's address space. Xe driver would
> create some xe_vma to cover a sub-region of a core mm vma. 

No again and again NO. hmm_range_fault users are not, and must not be
linked to CPU VMAs ever.

> > mapping is driven by the range passed to hmm_range_fault() during
> > fault handling, which is entirely based on your drivers prefetch
> > logic.
> 
> In our driver, mapping can be triggered by either prefetch or fault. 
> 
> Prefetch is a user API so user can decide the range.

> The range used in fault is decided by MIGRATION_GRANULARITY user
> setting. The default value is 2MiB as said.

Which is basically turning it into a fault prefetch tunable.

> In our implementation, our page table is not organized as a radix
> tree. 

Huh? What does the HW in the GPU do to figure out how issue DMAs? Your
GPU HW doesn't have a radix tree walker you are configuring?

> Maybe this an area we can improve. For validation, we don't
> need to walk page table to figure out which range is present in page
> table. Since we only register a mmu notifier when a range is
> actually mapped to gpu page table. So all the notifier callback is
> with a valid range in gpu page table.

But this is not page by page.

> I agree many 2M small notifiers can slow down the red/black tree
> walk from core mm side. But with giant notifier, core mm callback
> driver much more times than small notifier - driver would be called
> back even if a range is not mapped to gpu page table.

The driver should do a cheaper check than a red/black check if it
needs to do any work, eg with a typical radix page table, or some kind
of fast sparse bitmap. At a 2M granual and 100GB workloads these days
this is an obviously loosing idea.

> > I want to be very clear, there should be no interaction of your
> > hmm_range_fault based code with MM centric VMAs. You MUST NOT look
> > up
> > the CPU vma_area_struct in your driver.
> 
> Without look up cpu vma, we even can't decide whether our gpu
> accessed an valid address or not.

hmm_range_fault provides this for you.

> Further more, we call hmm helpers migrate_vma_setup for migration
> which take a struct migrate_vma parameter. Migrate_vma has a vma
> field. If we don't look up cpu vma, how do we get this field?

Migration is a different topic. The vma lookup for a migration has
nothing to do with hmm_range_fault and SVA.

(and perhaps arguably this is structured wrong as you probably want
hmm_range_fault to do the migrations for you as it walks the range to
avoid double walks and things)

Jason