Re: Need help with the mmu notifiers patches.

"Peter Teoh" <htmldeveloper@xxxxxxxxx> · Wed, 30 Jul 2008 17:46:18 +0800

Thank you for the sharing....enough for me to last a week of
research....thanks again :-).

On Tue, Jul 29, 2008 at 11:40 PM, Cédric Augonnet
<cedric.augonnet@xxxxxxxxx> wrote:
> Ahh at least we got them included in linus tree :)
>
> Sorry for the delayed answer ...
>
> 2008/7/28 Peter Teoh <htmldeveloper@xxxxxxxxx>:
>> On Sun, Jul 27, 2008 at 9:24 PM, Cédric Augonnet
>> <cedric.augonnet@xxxxxxxxx> wrote:
> [...]
>>>
>>> Hi,
>>>
>>> High performance networks are also a perfect target. Provided the need
>>> to reduce CPU usage and to avoid useless costly copies, high speed
>>> network interfaces cards (NICs)  do DMA data directly from or to the
>>> user-space memory (this is usually called "zero-copy" protocols, and
>>> RDMA is a form of zero-copy protocol).
>>>
>>> Imagine that a process wants to send some data to a network card :
>>> from the application perspective, this data is for instance described
>>> as a virtual address interval (start virtuel address + length).
>>> unfortunately most NICs cannot manipulate virtual addresses but only
>>> addresses in the PCI space.
>>>
>>> So in order to send a message, the network library would have to do
>>> the following :
>>> 1 - Do a system call to the NIC driver
>>> 2 - for each page in virtual address space, perform a "get_user_page"
>>> which will translate the virtual address into a pci address, which the
>>> NIC can manipulate. get_user_page will also "pin" the memory so that
>>> it cannot be swapped out for instance. This is also called memory
>>> registration or memory pinning.
>>> 3 - The NIC will be given the list of physical address (in pci space)
>>> so that it may DMA all the data directly from the application address
>>> space.
>>> 4 - When the communication is finished, the driver will put all pages
>>> back so that it can be reclaimed if needed.
>>>
>>> Now imagine that for some reason, an application will always reuse the
>>> same buffers (this is rather likely). Continously registering and
>>> unregistering the very same buffer is a serious overhead we cannot
>>> cope with in the context of high performance networking.
>>>
>>> To avoid that, various people start using a 'registration cache' to
>>> avoid useless memory (de)registration. The last n buffers are kept
>>> pinned, thus when the NIC needs those buffers again, it already has a
>>> virtual-to-physical address translation which greatly reduces the
>>> system overhead for that communication.
>>>
>>> Unfortunately imagine the following scenario : process P maps file F1
>>> in buffer B, P sends B over the network, (B is still pinned), P unmaps
>>
>> sorry...i don't understand this part....pinning the buffer and
>> unmapping is done in sync right?
>> in other words, pinng is done in kernel - VM layer (or your driver),
>> and then each of the process will individually map and unmap
>> independently - in the userspace.
>> so if u unmap the memory, immediately the memory is gone away - no
>> longer in the process's VMA  linked list (inside the kernel), so when
>> swapper go through the VMA linked list, there is nothing for it to
>> consider whether to swap out or not - right? so what is the meaning of
>> buffer in "pinned" state here?
>>
>
> To make things simpler let's consider a message that holds on a single
> page in the process address space. The pinning operation will just do
> a get_user_page on that page : this increments the refcount on the
> page structure and this translates the page virtual address to a
> physical address.
> Having such a reference, the page will not be swapped out in the case
> of memory pressure. However, if the VMA is remapped by the
> application, that reference will not be hold anymore, and the new
> struct page corresponding to the same virtual address will not be the
> one that was cached.
>
> So here, pinning means that we hold a reference to the page, and that
> the virtual-to-physical address translation is kept stored in the
> user-space driver. If you have pinned a data, you just tell the NIC to
> DMA the corresponding pages given its physical address which you
> "remember".
>
>>> F1 and maps F2 in B. P sends B again.
>>> So there the NIC assumes the physical addresses of B are still
>>> valids:it may therefore actually send the content of F1, or more
>>> likely junk ... which you absolutely *never* want of course.
>>>
>>> So would you drop zero-copy protocols ? that's a real waste. Instead,
>>> we need a way to monitor changes in the virtual memory mapping so that
>>> the networking lib may invalidate the buffers when needed. There are
>>> currently various approaches to that problem :
>>> For glibc users (that's already a restriction) just use the glibc
>>> hooks : when you call a remap, a free or whatever which *may* change
>>> the mapping, you first call a hook that will keep the cache valid.
>>
>> ok....so this means that unmapping operation is not the normal
>> unmapping....but a modified one....when the buffer (which is userspace
>> virtual address) is delinked fron process P's VMA linked list of
>> memory buffer (in the kernel), but the content is still there,  and no
>> body own it.   The next time process P try to map it again, it will
>> use back the same buffer, and because the TLB is not flushed out,
>> translation will be the same, ie, virtual address to access the
>> physical content, will be the same again.   is that correct?
>>
>> but problem is how does glibc's remap() operation knows which cache to
>> reuse, in its remapping operation?   there could be many unmapping
>> operation done before, all of which will generate a cached copy of the
>> buffer, and not really doing any unmapping, right?
>
> This is actually at a much higher level than TLB and so on, think of
> that cache as something really more simple : it is just a list of
> pinned buffers. When you send a buffer, you first scan the list to see
> if you have the corresponding entry that covers the given address
> range in your cache, if so, just send the corresponding pages without
> asking the driver to translate the addresses; if not, register the
> buffer in our cache and send it as well.
>
> So, here would be a sketchy _user-space_ function of the network
> library which would tell the NIC which page to DMA in:
>
> msg_send(addr, len)
> {
>   for all buffer in the list of pinned buffers {
>      if [addr, len[ is included in buffer {
>         /* use the cache */
>         send the list of physical pages to the NIC
>         return;
>      }
>   }
>
>   /* the buffer was not in the cache */
>   pin the buffer
>   add the buffer and its memory translation (the list of physical
> pages addresses) to the cache
>   send the list of physical pages to the NIC
> }
>
> The glibc hooks are just wrappers around the usual calls, for instance
> here is what the munmap hook could look like:
>
> munmap_hook (addr, len, ...)
> {
>   for all buffer in the list of pinned buffers  {
>      if [addr; addr + len [ intersects the buffer
>          invalidate the buffer (drop it from the list and unpin it)
>   }
>
>   munmap(addr, len, ...);
> }
>
> So the hook eventually calls the usual munmap function, but we first
> scan the cache (which is a simple list) to evict any buffer
> potentially invalidated by the munmap operation.
>
>>> This is often not possible  as some application are already statically
>>> linked with the glibc so that you cannot use those hooks afterwards.
>>> Also, some applications are also using such hooks so that the network
>>> lib cannot overwrite them ... Also imagine you have various network
>>> drivers which have the same need to use hooks (a process using both IB
>>> and myri10g for instance).
>>>
>>> For all those reasons, we do need a facility to monitor changes in the
>>> virtual memory mapping in the context of high performance netwoking,
>>> or even for GPGPUs or any device which heavily rely on DMA transfers
>>> between device and user-space.
>>>
>>> I was mentionning heterogeneous networks (eg. myri10g + IB +...) but
>>> in the GPGPU era, you have GPGPU + NICs possibly used by the same
>>> process in the case of a GPU-enabled cluster.
>>>
>>> Sorry for so lengthy mail, i hope it was not too unclear and that you
>>> are now convinced we DO want the mmu notifiers :)
>>>
>>
>> i like the sharing....
>
> That must be the intent of such ML after all :)
>
>>
>>> Cheers,
>>> Cédric
>>>
>>
>> Any references which i can read further for more knowledge in this
>> area?   I think i must be wrong, as all these sounds very new to me.
>>
>
> Well, even if those are rather simple concepts after all (even though
> it may not look so given my explainations), that's something really
> common i guess...I've personnaly been spending some time on that
> amusing problem but appart from networking guys, there must not be
> many users of that technique.
>
> Historically, the idea of maintaining a "pin-down" cache was first
> introduced by Tezuka et al.
> "Pin-down Cache: A Virtual Memory Management Technique for Zero-copy
> Communication" (1998)
> http://www.pccluster.org/score/papers/tezuka-ipps98.ps.gz
>
> Brice Goglin did some nice things about distributed file systems on
> Myrinet networks during his phd.
> Provided he was working in the kernel, and that there is no such
> things as glibc hooks there, he did some "VMA spy" patch which intent
> was very similar to the MMU notifiers: when a monitored VMA is
> modified, the cache is scanned to evict possibly invalid buffers.
> "An Efficient Network API for in-Kernel Applications in Clusters" (2005)
> http://hal.inria.fr/inria-00070445
>
> Personnaly, I tried to extend that work for the Myrinet eXpress (MX)
> library at the user level : the user process puts "hooks" on VMA, and
> when a VMA is modified (for instance when a file is unmapped), the
> hook invalidates the cache at the driver level in the kernel, so that
> the user level library can detect it later on.
> "Interval-based registration cache for zero-copy protocols" (2007)
> http://runtime.bordeaux.inria.fr/augonnet/myricom/Rapport-StageM1.pdf
>
> Of course, not only myrinet people are using that technique, there are
> similar works for others networks, see Wyckoff's work for this on
> InfiniBand :
> "Memory registration caching correctness" (2005)
> http://www.osc.edu/~pw/papers/wyckoff-memreg-ccgrid05.pdf
>
> Also, it is interesting to note that Quadrics NICs maintain a copy of
> the process page table in the NIC memory, maintaining this table
> consistent also required the use of a MMU-notifiers like patch which
> was called "IOProc"
> http://lkml.org/lkml/2005/4/26/198
>
> Who said there was nobody asking for some MMU-notifiers facility
> appart virtualization ? :) I'm sure there are many others, not only in
> networking !
>
> Cheers,
> Cédric
>

-- 
Regards,
Peter Teoh

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ