Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range

Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx> · Fri, 26 Apr 2024 11:55:05 +0200

Hi, Jason.

I've quickly read through the discussion here and have a couple of
questions and clarifications to hopefully help moving forward on
aligning on an approach for this.

Let's for simplicity initially ignore migration and assume this is on
integrated hardware since it seems like it's around the
hmm_range_fault() usage there is a disconnect.

First, the gpu_vma structure is something that partitions the gpu_vm
that holds gpu-related range metadata, like what to mirror, desired gpu
caching policies etc. These are managed (created, removed and split)
mainly from user-space. These are stored and looked up from an rb-tree.

Each such mirroring gpu_vma holds an mmu_interval notifier.

For invalidation only purposes, the mmu_interval seqno is not tracked.
An invalidation thus only zaps page-table entries, causing subsequent
accesses to fault. Hence for this purpose, having a single notifier
that covers a huge range is desirable and does not become a problem.

Now, when we hit a fault, we want to use hmm_range_fault() to re-
populate the faulting PTE, but also to pre-fault a range. Using a range
here (let's call this a prefault range for clarity) rather than to
insert a single PTE is for multiple reasons:

1) avoid subsequent adjacent faults
2a) Using huge GPU page-table entries.
2b) Updating the GPU page-table (tree-based and multi-level) becomes
more efficient when using a range.

Here, depending on hardware, 2a might be more or less crucial for GPU
performance. 2b somewhat ties into 2a but otherwise does not affect gpu
performance.

This is why we've been using dma_map_sg() for these ranges, since it is
assumed the benefits gained from 2) above by far outweighs any benefit
from finer-granularity dma-mappings on the rarer occasion of faults.
Are there other benefits from single-page dma mappings that you think
we need to consider here?

Second, when pre-faulting a range like this, the mmu interval notifier
seqno comes into play, until the gpu ptes for the prefault range are
safely in place. Now if an invalidation happens in a completely
separate part of the mirror range, it will bump the seqno and force us
to rerun the fault processing unnecessarily. Hence, for this purpose we
ideally just want to get a seqno bump covering the prefault range.
That's why finer-granularity mmu_interval notifiers might be beneficial
(and then cached for future re-use of the same prefault range). This
leads me to the next question:

You mention that mmu_notifiers are expensive to register. From looking
at the code it seems *mmu_interval* notifiers are cheap unless there
are ongoing invalidations in which case using a gpu_vma-wide notifier
would block anyway? Could you clarify a bit more the cost involved
here? If we don't register these smaller-range interval notifiers, do
you think the seqno bumps from unrelated subranges would be a real
problem?

Finally the size of the pre-faulting range is something we need to
tune. Currently it is cpu vma - wide. I understand you strongly suggest
this should be avoided. Could you elaborate a bit on why this is such a
bad choice?

Thanks,
Thomas