Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2/9/2013 8:05 AM, Michel Lespinasse wrote:
On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel <raindel@xxxxxxxxxxxx> wrote:
Hi,

We would like to present a reference implementation for safely sharing
memory pages from user space with the hardware, without pinning.

We will be happy to hear the community feedback on our prototype
implementation, and suggestions for future improvements.

We would also like to discuss adding features to the core MM subsystem to
assist hardware access to user memory without pinning.
This sounds kinda scary TBH; however I do understand the need for such
technology.
The technological challenges here are actually rather similar to the ones experienced
by hypervisors that want to allow swapping of virtual machines. As a result, we benefit
greatly from the mmu notifiers implemented for KVM. Reading the page table directly
will be another level of challenge.
I think one issue is that many MM developers are insufficiently aware
of such developments; having a technology presentation would probably
help there; but traditionally LSF/MM sessions are more interactive
between developers who are already quite familiar with the technology.
I think it would help if you could send in advance a detailed
presentation of the problem and the proposed solutions (and then what
they require of the MM layer) so people can be better prepared.
We hope to send out an RFC patch-set of the feature implementation for our hardware
soon, which might help to demonstrate a use case for the technology.

The current programming model for InfiniBand (and related network protocols - RoCE,
iWarp) relies on the user space program registering memory regions for use with the
hardware. Upon registration, the driver performs pinning (get_user_pages) of the
memory area, updates a mapping table in the hardware and provides the user
application with a handle for the mapping. The user space application then use this
handle to request the hardware to access this area for network IO.

While achieving unbeatable IO performance (round-trip latency, for user space programs,
of less than 2  microseconds, bandwidth of 56 Gbit/second), this model is relatively
hard to use:

- The need for explicit memory registration for each area makes the API rather
  complex to use. Ideal API would have a handle per process, that allows it to
  communicate with the hardware using the process virtual addresses.

- After a part of the address space has been registered, the application must be
  careful not to move the pages around. For example, doing a fork results in all of
  the memory registrations pointing to the wrong pages (which is very hard to debug).
  This was partially addressed at [1], but the cure is nearly as bad as the disease - when
  MADVISE_DONTFORK is used on the heap, a simple call to malloc in the child process
  might crash the process.

- Memory which was registered is not swappable. As a result, one cannot write
  applications that overcommit for physical memory while using this API. Similarly to
  what Jerome described about GPU applications, for network access the application
  might want to use ~10% of its allocated memory space, but it is required to either
  pin all of the memory, use heuristics to predict what memory will be used or
  perform expensive copying/pinning for every network transaction. All of these are
  non-optimal.

And first I'd like to ask, aren't IOMMUs supposed to already largely
solve this problem ? (probably a dumb question, but that just tells
you how much you need to explain :)


IOMMU v1 doesn't solve this problem, as it gives you only one mapping table per
PCI function. If you want ~64 processes on your machine to be able to access the
network, this is not nearly enough. It is helping in implementing PCI pass-thru for
virtualized guests (with the hardware devices exposing several virtual PCI functions
for the guests), but that is still not enough for user space applications.

To some extant, IOMMU v1 might even be an obstacle to implementing such
feature, as it prevents PCI devices from accessing parts of the memory, requiring
driver intervention for every page fault, even if the page is in memory.

IOMMU v2 [2] is a step at the same direction that we are moving towards, offering
PASID - a unique identifier for each transaction that the device performs, allowing
to associate the transaction with a specific process. However, the challenges there
are similar to these we encounter when using an address translation table on the
PCI device itself (NIC/GPU).

References:

1. MADVISE_DONTFORK - http://lwn.net/Articles/171956/
2. AMD IOMMU v2 - http://www.linux-kvm.org/wiki/images/b/b1/2011-forum-amd-iommuv2-kvm.pdf


[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]