Re: [LSF/MM/BPF TOPIC] The future of ZONE_DEVICE pages

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 31.01.25 03:59, Alistair Popple wrote:
I have a few topics that I would like to discuss around ZONE_DEVICE pages
and their current and future usage in the kernel. Generally these pages are
used to represent various forms of device memory (PCIe BAR space, coherent
accelerator memory, persistent memory, unaddressable device memory). All
of these require special treatment by the core MM so many features must be
implemented specifically for ZONE_DEVICE pages.

I would like to get feedback on several ideas I've had for a while:

Large page migration for ZONE_DEVICE pages
==========================================

Currently large ZONE_DEVICE pages only exist for persistent memory use cases
(DAX, FS DAX). This involves a special reference counting scheme which I hope to
have fixed[1] by the time of the LSF/MM/BPF. Fixing this allows for other higher
order ZONE_DEVICE folios.

Specifically I would like to introduce the possiblity of migrating large CPU
folios to unaddressable (DEVICE_PRIVATE) or coherent (DEVICE_COHERENT) memory.
The current interfaces (migrate_vma) don't allow that as they require all folios
to be split.


Hi,

Some of the issues are:

1. What should the interface look like?

These are non-lru pages, so likely there is overlap with "non-lru page migration
in a memdesc world"[2]

Yes, although these (what we called "non-lru migration" before ZONE_DEVICE popped up) are currently all order-0. Likely this will change at some point, but not sure if there is currently a real demand for it.

Agreed that there is quite some overlap. E.g., no page->lru field, and the problem about splitting large allocations etc.

For example, balloon-inflated pages are currently all order-0. If we'd want to support something larger but still allow for reliable balloon compaction under memory fragmentation, we'd want an option to split-before-migration (similar as you describe below).

Alternatively, we can just split right at the start: if the balloon allocated a 2MiB compound page, it can just split it to 512 order-0 pages and allow for migration of the individual pieces. Both approaches have their pros and cons.

Anyway: "non-lru migration" is not quite expressive. It's likely going to be:

(1) LRU folio migration
(2) non-LRU folio migration (->ZONE_DEVICE)
(3) non-folio migration (balloon,zsmalloc, ...)

(1) and (2) have things in common (e.g., rmap, folio handling) and (2) and (3) have things in common (e.g., no ->lru field).

Would there be something ZONE_DEVICE based that we want to migrate and that will not be a folio (iow, not mapped into user page tables etc)?


2. How do we allow merging/splitting of pages during migration?

This is neccessary because when migrating back from device memory there may not
be enough large CPU pages available.

3. Any other issues?

[1] - https://lore.kernel.org/linux-mm/cover.11189864684e31260d1408779fac9db80122047b.1736488799.git-series.apopple@xxxxxxxxxx/
[2] - https://lore.kernel.org/linux-mm/2612ac8a-d0a9-452b-a53d-75ffc6166224@xxxxxxxxxx/

File-backed DEVICE_PRIVATE/COHERENT pages
=========================================

Currently DEVICE_PRVIATE and DEVICE_COHERENT pages are only supported for
private anonymous memory. This prevents devices from having local access to
shared or file-backed mappings instead relying on remote DMA access which limits
performance.

I have been prototyping allowing ZONE_DEVICE pages in the page cache with
a callback when the CPU requires access.

Hmm, things like read/write/writeback get more tricky. How would you writeback content from a ZONE_DEVICE folio? Likely that's not possible.

So I'm not sure if we want to go down that path; it will be great to learn about your approach and your findings.

[...]


There is a lot of interesting stuff in there; I assume too much for a single session :)

--
Cheers,

David / dhildenb





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux