[LSF/MM/BPF TOPIC] The future of ZONE_DEVICE pages

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I have a few topics that I would like to discuss around ZONE_DEVICE pages
and their current and future usage in the kernel. Generally these pages are
used to represent various forms of device memory (PCIe BAR space, coherent
accelerator memory, persistent memory, unaddressable device memory). All
of these require special treatment by the core MM so many features must be
implemented specifically for ZONE_DEVICE pages.

I would like to get feedback on several ideas I've had for a while:

Large page migration for ZONE_DEVICE pages
==========================================

Currently large ZONE_DEVICE pages only exist for persistent memory use cases
(DAX, FS DAX). This involves a special reference counting scheme which I hope to
have fixed[1] by the time of the LSF/MM/BPF. Fixing this allows for other higher
order ZONE_DEVICE folios.

Specifically I would like to introduce the possiblity of migrating large CPU
folios to unaddressable (DEVICE_PRIVATE) or coherent (DEVICE_COHERENT) memory.
The current interfaces (migrate_vma) don't allow that as they require all folios
to be split.

Some of the issues are:

1. What should the interface look like?

These are non-lru pages, so likely there is overlap with "non-lru page migration
in a memdesc world"[2]

2. How do we allow merging/splitting of pages during migration?

This is neccessary because when migrating back from device memory there may not
be enough large CPU pages available.

3. Any other issues?

[1] - https://lore.kernel.org/linux-mm/cover.11189864684e31260d1408779fac9db80122047b.1736488799.git-series.apopple@xxxxxxxxxx/
[2] - https://lore.kernel.org/linux-mm/2612ac8a-d0a9-452b-a53d-75ffc6166224@xxxxxxxxxx/

File-backed DEVICE_PRIVATE/COHERENT pages
=========================================

Currently DEVICE_PRVIATE and DEVICE_COHERENT pages are only supported for
private anonymous memory. This prevents devices from having local access to
shared or file-backed mappings instead relying on remote DMA access which limits
performance.

I have been prototyping allowing ZONE_DEVICE pages in the page cache with
a callback when the CPU requires access. This approach seems promising and
relatively straight-forward but I would like some early feedback on either this
or alternate approaches that I should investigate.

Combining P2PDMA and DEVICE_PRIVATE pages
=========================================

Currently device memory that cannot be directly accessed via the CPU can be
represented by DEVICE_PRIVATE pages allowing it to be mapped and treated like
a normal virtual page by userpsace. Many devices also support accessing device
memory directly from the CPU via a PCIe BAR.

This access requires a P2PDMA page, meaning there are potentially two pages
tracking the same piece of physical memory. This not only seems wasteful but
fraught - for example device drivers need to keep page lifetimes in sync. I
would like to discuss ways of solving this.

DEVICE_PRIVATE pages, the linear map and the memdesc world
==========================================================

DEVICE_PRIVATE pages currently reside in the linear map such that pfn_to_page()
and page_to_pfn() work "as expected". However this implies a contiguous range
of unused physical addresses need to be both available and allocated for device
memory. This isn't always available, particularly on ARM[1] where the vmemmap
region may not be large enough to accomodate the amount of device memory.

However it occurs to me that (almost?) all code paths that deal with
DEVICE_PRIVATE pages are already aware of this - in the case of page_to_pfn()
the page can be directly queried with is_device_private_page() and in the case
of pfn_to_page() the pfn has (almost?) always been obtained from a special swap
entry indicating such.

So does page_to_pfn()/pfn_to_page() really need to work for DEIVCE_PRIVATE
pages? If not could we allocate the struct pages in a vmalloc array instead? Do
we even need ZONE_DEIVCE pages/folios in a memdesc world?

[1] - https://lore.kernel.org/linux-arm-kernel/CAMj1kXHxyntweiq76CdW=ov2_CkEQUbdPekGNDtFp7rBCJJE2w@xxxxxxxxxxxxxx/

Other issues/ideas
==================

Are there any other clean-ups or features that people are interested in seeing?

 - Alistair




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux