Re: [LSF/MM/BPF TOPIC] The future of ZONE_DEVICE pages

Alistair Popple <apopple@xxxxxxxxxx> · Wed, 5 Feb 2025 21:12:33 +1100

On Fri, Jan 31, 2025 at 09:47:39AM +0100, David Hildenbrand wrote:
> On 31.01.25 03:59, Alistair Popple wrote:
> > I have a few topics that I would like to discuss around ZONE_DEVICE pages
> > and their current and future usage in the kernel. Generally these pages are
> > used to represent various forms of device memory (PCIe BAR space, coherent
> > accelerator memory, persistent memory, unaddressable device memory). All
> > of these require special treatment by the core MM so many features must be
> > implemented specifically for ZONE_DEVICE pages.
> > 
> > I would like to get feedback on several ideas I've had for a while:
> > 
> > Large page migration for ZONE_DEVICE pages
> > ==========================================
> > 
> > Currently large ZONE_DEVICE pages only exist for persistent memory use cases
> > (DAX, FS DAX). This involves a special reference counting scheme which I hope to
> > have fixed[1] by the time of the LSF/MM/BPF. Fixing this allows for other higher
> > order ZONE_DEVICE folios.
> > 
> > Specifically I would like to introduce the possiblity of migrating large CPU
> > folios to unaddressable (DEVICE_PRIVATE) or coherent (DEVICE_COHERENT) memory.
> > The current interfaces (migrate_vma) don't allow that as they require all folios
> > to be split.
> > 
> 
> Hi,
> 
> > Some of the issues are:
> > 
> > 1. What should the interface look like?
> > 
> > These are non-lru pages, so likely there is overlap with "non-lru page migration
> > in a memdesc world"[2]
> 
> Yes, although these (what we called "non-lru migration" before ZONE_DEVICE
> popped up) are currently all order-0. Likely this will change at some point,
> but not sure if there is currently a real demand for it.
> 
> Agreed that there is quite some overlap. E.g., no page->lru field, and the
> problem about splitting large allocations etc.
> 
> For example, balloon-inflated pages are currently all order-0. If we'd want
> to support something larger but still allow for reliable balloon compaction
> under memory fragmentation, we'd want an option to split-before-migration
> (similar as you describe below).
> 
> Alternatively, we can just split right at the start: if the balloon
> allocated a 2MiB compound page, it can just split it to 512 order-0 pages
> and allow for migration of the individual pieces. Both approaches have their
> pros and cons.
> 
> Anyway: "non-lru migration" is not quite expressive. It's likely going to
> be:
> 
> (1) LRU folio migration
> (2) non-LRU folio migration (->ZONE_DEVICE)
> (3) non-folio migration (balloon,zsmalloc, ...)
> 
> (1) and (2) have things in common (e.g., rmap, folio handling) and (2) and
> (3) have things in common (e.g., no ->lru field).
> 
> Would there be something ZONE_DEVICE based that we want to migrate and that
> will not be a folio (iow, not mapped into user page tables etc)?

I'm not aware of any such use-cases. Your case (2) above is what I was thinking
about.

> > 
> > 2. How do we allow merging/splitting of pages during migration?
> > 
> > This is neccessary because when migrating back from device memory there may not
> > be enough large CPU pages available.
> > 
> > 3. Any other issues?
> > 
> > [1] - https://lore.kernel.org/linux-mm/cover.11189864684e31260d1408779fac9db80122047b.1736488799.git-series.apopple@xxxxxxxxxx/
> > [2] - https://lore.kernel.org/linux-mm/2612ac8a-d0a9-452b-a53d-75ffc6166224@xxxxxxxxxx/
> > 
> > File-backed DEVICE_PRIVATE/COHERENT pages
> > =========================================
> > 
> > Currently DEVICE_PRVIATE and DEVICE_COHERENT pages are only supported for
> > private anonymous memory. This prevents devices from having local access to
> > shared or file-backed mappings instead relying on remote DMA access which limits
> > performance.
> > 
> > I have been prototyping allowing ZONE_DEVICE pages in the page cache with
> > a callback when the CPU requires access.
> 
> Hmm, things like read/write/writeback get more tricky. How would you
> writeback content from a ZONE_DEVICE folio? Likely that's not possible.

The general gist is somewhat analogous to what happens when the CPU faults on
a DEVICE_PRIVATE page. Except obviously it wouldn't be a fault, rather whenever
something looked up the page-cache entry and found a DEVICE_PRIVATE page we
would have a driver callback somewhat similar to migrate_to_ram() that would
copy the data back to normal system memory. IOW CPU would always own the page
and could always get it back.

It has been a while since I last looked at this problem though (FS DAX refcount
clean ups took way longer than expected!), but I recall having this at least
somewhat working. I will see if I can get it cleaned up and posted as an RFC
soon.

> So I'm not sure if we want to go down that path; it will be great to learn
> about your approach and your findings.
> 
> [...]
> 
> 
> There is a lot of interesting stuff in there; I assume too much for a single
> session :)

And probably way more than I can get done in a year :-)

> 
> -- 
> Cheers,
> 
> David / dhildenb
>