Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes

Shachar Raindel <raindel@xxxxxxxxxxxx> · Sun, 10 Feb 2013 09:54:57 +0200

    On 2/9/2013 8:05 AM, Michel Lespinasse wrote:

      On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel <raindel@xxxxxxxxxxxx> wrote:

        Hi,

We would like to present a reference implementation for safely sharing
memory pages from user space with the hardware, without pinning.

We will be happy to hear the community feedback on our prototype
implementation, and suggestions for future improvements.

We would also like to discuss adding features to the core MM subsystem to
assist hardware access to user memory without pinning.

      This sounds kinda scary TBH; however I do understand the need for such
technology.

    The technological challenges here are actually rather similar to the
    ones experienced

    by hypervisors that want to allow swapping of virtual machines. As a
    result, we benefit

    greatly from the mmu notifiers implemented for KVM. Reading the page
    table directly

    will be another level of challenge.

      I think one issue is that many MM developers are insufficiently aware
of such developments; having a technology presentation would probably
help there; but traditionally LSF/MM sessions are more interactive
between developers who are already quite familiar with the technology.
I think it would help if you could send in advance a detailed
presentation of the problem and the proposed solutions (and then what
they require of the MM layer) so people can be better prepared.

    We hope to send out an RFC patch-set of the feature implementation
    for our hardware

    soon, which might help to demonstrate a use case for the technology.

    The current programming model for InfiniBand (and related network
    protocols - RoCE,

    iWarp) relies on the user space program registering memory regions
    for use with the

    hardware. Upon registration, the driver performs pinning
    (get_user_pages) of the

    memory area, updates a mapping table in the hardware and provides
    the user 

    application with a handle for the mapping. The user space
    application then use this

    handle to request the hardware to access this area for network IO.

    While achieving unbeatable IO performance (round-trip latency, for
    user space programs,

    of less than 2  microseconds, bandwidth of 56 Gbit/second), this
    model is relatively

    hard to use:

    - The need for explicit memory registration for each area makes the
    API rather

      complex to use. Ideal API would have a handle per process, that
    allows it to

      communicate with the hardware using the process virtual addresses.

    - After a part of the address space has been registered, the
    application must be

      careful not to move the pages around. For example, doing a fork
    results in all of 

      the memory registrations pointing to the wrong pages (which is
    very hard to debug). 

      This was partially addressed at [1], but the cure is nearly as bad
    as the disease - when 

      MADVISE_DONTFORK is used on the heap, a simple call to malloc in
    the child process

      might crash the process.

    - Memory which was registered is not swappable. As a result, one
    cannot write 

      applications that overcommit for physical memory while using this
    API. Similarly to

      what

    Jerome described about GPU applications, for network access the
    application

      might want to use ~10% of its allocated memory space, but it is
    required to either

      pin all of the memory, use heuristics to predict what memory will
    be used or

      perform expensive copying/pinning for every network transaction.
    All of these are

      non-optimal.

      And first I'd like to ask, aren't IOMMUs supposed to already largely
solve this problem ? (probably a dumb question, but that just tells
you how much you need to explain :)

    IOMMU v1 doesn't solve this problem, as it gives you only one
    mapping table per 

    PCI function. If you want ~64 processes on your machine to be able
    to access the

    network, this is not nearly enough. It is helping in implementing
    PCI pass-thru for

    virtualized guests (with the hardware devices exposing several
    virtual PCI functions

    for the guests), but that is still not enough for user space
    applications.

    To some extant, IOMMU v1 might even be an obstacle to implementing
    such 

    feature, as it prevents PCI devices from accessing parts of the
    memory, requiring

    driver intervention for every page fault, even if the page is in
    memory.

    IOMMU v2 [2] is a step at the same direction that we are moving
    towards, offering

    PASID - a unique identifier for each transaction that the device
    performs, allowing

    to associate the transaction with a specific process. However, the
    challenges there

    are similar to these we encounter when using an address translation
    table on the

    PCI device itself (NIC/GPU).

    References:

    1. MADVISE_DONTFORK - http://lwn.net/Articles/171956/

    2. AMD IOMMU v2 - http://www.linux-kvm.org/wiki/images/b/b1/2011-forum-amd-iommuv2-kvm.pdf