Re: [RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages

Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx> · Tue, 04 Feb 2025 10:32:32 +0100

On Mon, 2025-02-03 at 11:08 -0400, Jason Gunthorpe wrote:
> On Fri, Jan 31, 2025 at 05:59:26PM +0100, Simona Vetter wrote:
> 
> > So one aspect where I don't like the pgmap->owner approach much is
> > that
> > it's a big thing to get right, and it feels a bit to me that we
> > don't yet
> > know the right questions.
> 
> Well, I would say it isn't really complete yet. No driver has yet
> attempted to use a private interconnect with these scheme. Probably
> it
> needs more work.
> 
> > A bit related is that we'll have to do some driver-specific
> > migration
> > after hmm_range_fault anyway for allocation policies. With coherent
> > interconnect that'd be up to numactl, but for driver private it's
> > all up
> > to the driver. And once we have that, we can also migrate memory
> > around
> > that's misplaced for functional and not just performance reasons.
> 
> Are you sure? This doesn't seem to what any hmm_range_fault() user
> should be doing. hmm_range_fault() is to help mirror the page table
> to a secondary, that is all. Migration policy shouldn't be part of
> it,
> just mirroring doesn't necessarily mean any access was performed, for
> instance.
> 
> And mirroring doesn't track any access done by non-faulting cases
> either.
> 
> > The plan I discussed with Thomas a while back at least for gpus was
> > to
> > have that as a drm_devpagemap library, 
> 
> I would not be happy to see this. Please improve pagemap directly if
> you think you need more things.

These are mainly helpers to migrate and populate a range of cpu memory
space (struct mm_struct) with GPU device_private memory, migrate to
system on gpu memory shortage and implement the migrate_to_vram pagemap
op, tied to gpu device memory allocations, so I don't think there is
anything we should be exposing at the dev_pagemap level at this point?

> 
> > which would have a common owner (or
> > maybe per driver or so as Thomas suggested). 
> 
> Neither really match the expected design here. The owner should be
> entirely based on reachability. Devices that cannot reach each other
> directly should have different owners.

Actually what I'm putting together is a small helper to allocate and
assign an "owner" based on devices that are previously registered to a
"registry". The caller has to indicate using a callback function for
each struct device pair whether there is a fast interconnect available,
and this is expected to be done at pagemap creation time, so I think
this aligns with the above. Initially a "registry" (which is a list of
device-owner pairs) will be driver-local, but could easily have a wider
scope.

This means we handle access control, unplug checks and similar at
migration time, typically before hmm_range_fault(), and the role of
hmm_range_fault() will be to over pfns whose backing memory is directly
accessible to the device, else migrate to system.

Device unplug would then be handled by refusing migrations to the
device (gpu drivers would probably use drm_dev_enter()), and then evict
all device memory after a drm_dev_unplug(). This of course relies on
that eviction is more or less failsafe.

/Thomas

> 
> > But upfront speccing all this out doesn't seem like a good idea to,
> > because I honestly don't know what we all need.
> 
> This is why it is currently just void *owner  :)

Again, with the above I think we are good for now, but having
experimented a lot with the callback, I'm still not convinced by the
performance argument, for the following reasons.

1) Existing users would never use the callback. They can still rely on
the owner check, only if that fails we check for callback existence.
2) By simply caching the result from the last checked dev_pagemap, most
callback calls could typically be eliminated.
3) As mentioned before, a callback call would typically always be
followed by either migration to ram or a page-table update. Compared to
these, the callback overhead would IMO be unnoticeable.
4) pcie_p2p is already planning a dev_pagemap callback?

/Thomas

> 
> Jason