On Fri, Jan 31, 2025 at 09:47:39AM +0100, David Hildenbrand wrote: > On 31.01.25 03:59, Alistair Popple wrote: > > I have a few topics that I would like to discuss around ZONE_DEVICE pages > > and their current and future usage in the kernel. Generally these pages are > > used to represent various forms of device memory (PCIe BAR space, coherent > > accelerator memory, persistent memory, unaddressable device memory). All > > of these require special treatment by the core MM so many features must be > > implemented specifically for ZONE_DEVICE pages. > > > > I would like to get feedback on several ideas I've had for a while: > > > > Large page migration for ZONE_DEVICE pages > > ========================================== > > > > Currently large ZONE_DEVICE pages only exist for persistent memory use cases > > (DAX, FS DAX). This involves a special reference counting scheme which I hope to > > have fixed[1] by the time of the LSF/MM/BPF. Fixing this allows for other higher > > order ZONE_DEVICE folios. > > > > Specifically I would like to introduce the possiblity of migrating large CPU > > folios to unaddressable (DEVICE_PRIVATE) or coherent (DEVICE_COHERENT) memory. > > The current interfaces (migrate_vma) don't allow that as they require all folios > > to be split. > > > > Hi, > > > Some of the issues are: > > > > 1. What should the interface look like? > > > > These are non-lru pages, so likely there is overlap with "non-lru page migration > > in a memdesc world"[2] > > Yes, although these (what we called "non-lru migration" before ZONE_DEVICE > popped up) are currently all order-0. Likely this will change at some point, > but not sure if there is currently a real demand for it. > > Agreed that there is quite some overlap. E.g., no page->lru field, and the > problem about splitting large allocations etc. > > For example, balloon-inflated pages are currently all order-0. If we'd want > to support something larger but still allow for reliable balloon compaction > under memory fragmentation, we'd want an option to split-before-migration > (similar as you describe below). > > Alternatively, we can just split right at the start: if the balloon > allocated a 2MiB compound page, it can just split it to 512 order-0 pages > and allow for migration of the individual pieces. Both approaches have their > pros and cons. > > Anyway: "non-lru migration" is not quite expressive. It's likely going to > be: > > (1) LRU folio migration > (2) non-LRU folio migration (->ZONE_DEVICE) > (3) non-folio migration (balloon,zsmalloc, ...) > > (1) and (2) have things in common (e.g., rmap, folio handling) and (2) and > (3) have things in common (e.g., no ->lru field). > > Would there be something ZONE_DEVICE based that we want to migrate and that > will not be a folio (iow, not mapped into user page tables etc)? I'm not aware of any such use-cases. Your case (2) above is what I was thinking about. > > > > 2. How do we allow merging/splitting of pages during migration? > > > > This is neccessary because when migrating back from device memory there may not > > be enough large CPU pages available. > > > > 3. Any other issues? > > > > [1] - https://lore.kernel.org/linux-mm/cover.11189864684e31260d1408779fac9db80122047b.1736488799.git-series.apopple@xxxxxxxxxx/ > > [2] - https://lore.kernel.org/linux-mm/2612ac8a-d0a9-452b-a53d-75ffc6166224@xxxxxxxxxx/ > > > > File-backed DEVICE_PRIVATE/COHERENT pages > > ========================================= > > > > Currently DEVICE_PRVIATE and DEVICE_COHERENT pages are only supported for > > private anonymous memory. This prevents devices from having local access to > > shared or file-backed mappings instead relying on remote DMA access which limits > > performance. > > > > I have been prototyping allowing ZONE_DEVICE pages in the page cache with > > a callback when the CPU requires access. > > Hmm, things like read/write/writeback get more tricky. How would you > writeback content from a ZONE_DEVICE folio? Likely that's not possible. The general gist is somewhat analogous to what happens when the CPU faults on a DEVICE_PRIVATE page. Except obviously it wouldn't be a fault, rather whenever something looked up the page-cache entry and found a DEVICE_PRIVATE page we would have a driver callback somewhat similar to migrate_to_ram() that would copy the data back to normal system memory. IOW CPU would always own the page and could always get it back. It has been a while since I last looked at this problem though (FS DAX refcount clean ups took way longer than expected!), but I recall having this at least somewhat working. I will see if I can get it cleaned up and posted as an RFC soon. > So I'm not sure if we want to go down that path; it will be great to learn > about your approach and your findings. > > [...] > > > There is a lot of interesting stuff in there; I assume too much for a single > session :) And probably way more than I can get done in a year :-) > > -- > Cheers, > > David / dhildenb >