On 2/9/2013 8:05 AM, Michel Lespinasse wrote:
The technological challenges here are actually rather similar to the ones experiencedOn Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel <raindel@xxxxxxxxxxxx> wrote:Hi, We would like to present a reference implementation for safely sharing memory pages from user space with the hardware, without pinning. We will be happy to hear the community feedback on our prototype implementation, and suggestions for future improvements. We would also like to discuss adding features to the core MM subsystem to assist hardware access to user memory without pinning.This sounds kinda scary TBH; however I do understand the need for such technology. by hypervisors that want to allow swapping of virtual machines. As a result, we benefit greatly from the mmu notifiers implemented for KVM. Reading the page table directly will be another level of challenge. We hope to send out an RFC patch-set of the feature implementation for our hardwareI think one issue is that many MM developers are insufficiently aware of such developments; having a technology presentation would probably help there; but traditionally LSF/MM sessions are more interactive between developers who are already quite familiar with the technology. I think it would help if you could send in advance a detailed presentation of the problem and the proposed solutions (and then what they require of the MM layer) so people can be better prepared. soon, which might help to demonstrate a use case for the technology. The current programming model for InfiniBand (and related network protocols - RoCE, iWarp) relies on the user space program registering memory regions for use with the hardware. Upon registration, the driver performs pinning (get_user_pages) of the memory area, updates a mapping table in the hardware and provides the user application with a handle for the mapping. The user space application then use this handle to request the hardware to access this area for network IO. While achieving unbeatable IO performance (round-trip latency, for user space programs, of less than 2 microseconds, bandwidth of 56 Gbit/second), this model is relatively hard to use: - The need for explicit memory registration for each area makes the API rather complex to use. Ideal API would have a handle per process, that allows it to communicate with the hardware using the process virtual addresses. - After a part of the address space has been registered, the application must be careful not to move the pages around. For example, doing a fork results in all of the memory registrations pointing to the wrong pages (which is very hard to debug). This was partially addressed at [1], but the cure is nearly as bad as the disease - when MADVISE_DONTFORK is used on the heap, a simple call to malloc in the child process might crash the process. - Memory which was registered is not swappable. As a result, one cannot write applications that overcommit for physical memory while using this API. Similarly to what Jerome described about GPU applications, for network access the application might want to use ~10% of its allocated memory space, but it is required to either pin all of the memory, use heuristics to predict what memory will be used or perform expensive copying/pinning for every network transaction. All of these are non-optimal. And first I'd like to ask, aren't IOMMUs supposed to already largely solve this problem ? (probably a dumb question, but that just tells you how much you need to explain :) IOMMU v1 doesn't solve this problem, as it gives you only one mapping table per PCI function. If you want ~64 processes on your machine to be able to access the network, this is not nearly enough. It is helping in implementing PCI pass-thru for virtualized guests (with the hardware devices exposing several virtual PCI functions for the guests), but that is still not enough for user space applications. To some extant, IOMMU v1 might even be an obstacle to implementing such feature, as it prevents PCI devices from accessing parts of the memory, requiring driver intervention for every page fault, even if the page is in memory. IOMMU v2 [2] is a step at the same direction that we are moving towards, offering PASID - a unique identifier for each transaction that the device performs, allowing to associate the transaction with a specific process. However, the challenges there are similar to these we encounter when using an address translation table on the PCI device itself (NIC/GPU). References: 1. MADVISE_DONTFORK - http://lwn.net/Articles/171956/ 2. AMD IOMMU v2 - http://www.linux-kvm.org/wiki/images/b/b1/2011-forum-amd-iommuv2-kvm.pdf |