Re: Need help with the mmu notifiers patches.

"Cédric Augonnet" <cedric.augonnet@xxxxxxxxx> · Tue, 29 Jul 2008 17:40:09 +0200

Ahh at least we got them included in linus tree :)

Sorry for the delayed answer ...

2008/7/28 Peter Teoh <htmldeveloper@xxxxxxxxx>:
> On Sun, Jul 27, 2008 at 9:24 PM, Cédric Augonnet
> <cedric.augonnet@xxxxxxxxx> wrote:
[...]
>>
>> Hi,
>>
>> High performance networks are also a perfect target. Provided the need
>> to reduce CPU usage and to avoid useless costly copies, high speed
>> network interfaces cards (NICs)  do DMA data directly from or to the
>> user-space memory (this is usually called "zero-copy" protocols, and
>> RDMA is a form of zero-copy protocol).
>>
>> Imagine that a process wants to send some data to a network card :
>> from the application perspective, this data is for instance described
>> as a virtual address interval (start virtuel address + length).
>> unfortunately most NICs cannot manipulate virtual addresses but only
>> addresses in the PCI space.
>>
>> So in order to send a message, the network library would have to do
>> the following :
>> 1 - Do a system call to the NIC driver
>> 2 - for each page in virtual address space, perform a "get_user_page"
>> which will translate the virtual address into a pci address, which the
>> NIC can manipulate. get_user_page will also "pin" the memory so that
>> it cannot be swapped out for instance. This is also called memory
>> registration or memory pinning.
>> 3 - The NIC will be given the list of physical address (in pci space)
>> so that it may DMA all the data directly from the application address
>> space.
>> 4 - When the communication is finished, the driver will put all pages
>> back so that it can be reclaimed if needed.
>>
>> Now imagine that for some reason, an application will always reuse the
>> same buffers (this is rather likely). Continously registering and
>> unregistering the very same buffer is a serious overhead we cannot
>> cope with in the context of high performance networking.
>>
>> To avoid that, various people start using a 'registration cache' to
>> avoid useless memory (de)registration. The last n buffers are kept
>> pinned, thus when the NIC needs those buffers again, it already has a
>> virtual-to-physical address translation which greatly reduces the
>> system overhead for that communication.
>>
>> Unfortunately imagine the following scenario : process P maps file F1
>> in buffer B, P sends B over the network, (B is still pinned), P unmaps
>
> sorry...i don't understand this part....pinning the buffer and
> unmapping is done in sync right?
> in other words, pinng is done in kernel - VM layer (or your driver),
> and then each of the process will individually map and unmap
> independently - in the userspace.
> so if u unmap the memory, immediately the memory is gone away - no
> longer in the process's VMA  linked list (inside the kernel), so when
> swapper go through the VMA linked list, there is nothing for it to
> consider whether to swap out or not - right? so what is the meaning of
> buffer in "pinned" state here?
>

To make things simpler let's consider a message that holds on a single
page in the process address space. The pinning operation will just do
a get_user_page on that page : this increments the refcount on the
page structure and this translates the page virtual address to a
physical address.
Having such a reference, the page will not be swapped out in the case
of memory pressure. However, if the VMA is remapped by the
application, that reference will not be hold anymore, and the new
struct page corresponding to the same virtual address will not be the
one that was cached.

So here, pinning means that we hold a reference to the page, and that
the virtual-to-physical address translation is kept stored in the
user-space driver. If you have pinned a data, you just tell the NIC to
DMA the corresponding pages given its physical address which you
"remember".

>> F1 and maps F2 in B. P sends B again.
>> So there the NIC assumes the physical addresses of B are still
>> valids:it may therefore actually send the content of F1, or more
>> likely junk ... which you absolutely *never* want of course.
>>
>> So would you drop zero-copy protocols ? that's a real waste. Instead,
>> we need a way to monitor changes in the virtual memory mapping so that
>> the networking lib may invalidate the buffers when needed. There are
>> currently various approaches to that problem :
>> For glibc users (that's already a restriction) just use the glibc
>> hooks : when you call a remap, a free or whatever which *may* change
>> the mapping, you first call a hook that will keep the cache valid.
>
> ok....so this means that unmapping operation is not the normal
> unmapping....but a modified one....when the buffer (which is userspace
> virtual address) is delinked fron process P's VMA linked list of
> memory buffer (in the kernel), but the content is still there,  and no
> body own it.   The next time process P try to map it again, it will
> use back the same buffer, and because the TLB is not flushed out,
> translation will be the same, ie, virtual address to access the
> physical content, will be the same again.   is that correct?
>
> but problem is how does glibc's remap() operation knows which cache to
> reuse, in its remapping operation?   there could be many unmapping
> operation done before, all of which will generate a cached copy of the
> buffer, and not really doing any unmapping, right?

This is actually at a much higher level than TLB and so on, think of
that cache as something really more simple : it is just a list of
pinned buffers. When you send a buffer, you first scan the list to see
if you have the corresponding entry that covers the given address
range in your cache, if so, just send the corresponding pages without
asking the driver to translate the addresses; if not, register the
buffer in our cache and send it as well.

So, here would be a sketchy _user-space_ function of the network
library which would tell the NIC which page to DMA in:

msg_send(addr, len)
{
   for all buffer in the list of pinned buffers {
      if [addr, len[ is included in buffer {
         /* use the cache */
         send the list of physical pages to the NIC
         return;
      }
   }

   /* the buffer was not in the cache */
   pin the buffer
   add the buffer and its memory translation (the list of physical
pages addresses) to the cache
   send the list of physical pages to the NIC
}

The glibc hooks are just wrappers around the usual calls, for instance
here is what the munmap hook could look like:

munmap_hook (addr, len, ...)
{
   for all buffer in the list of pinned buffers  {
      if [addr; addr + len [ intersects the buffer
          invalidate the buffer (drop it from the list and unpin it)
   }

   munmap(addr, len, ...);
}

So the hook eventually calls the usual munmap function, but we first
scan the cache (which is a simple list) to evict any buffer
potentially invalidated by the munmap operation.

>> This is often not possible  as some application are already statically
>> linked with the glibc so that you cannot use those hooks afterwards.
>> Also, some applications are also using such hooks so that the network
>> lib cannot overwrite them ... Also imagine you have various network
>> drivers which have the same need to use hooks (a process using both IB
>> and myri10g for instance).
>>
>> For all those reasons, we do need a facility to monitor changes in the
>> virtual memory mapping in the context of high performance netwoking,
>> or even for GPGPUs or any device which heavily rely on DMA transfers
>> between device and user-space.
>>
>> I was mentionning heterogeneous networks (eg. myri10g + IB +...) but
>> in the GPGPU era, you have GPGPU + NICs possibly used by the same
>> process in the case of a GPU-enabled cluster.
>>
>> Sorry for so lengthy mail, i hope it was not too unclear and that you
>> are now convinced we DO want the mmu notifiers :)
>>
>
> i like the sharing....

That must be the intent of such ML after all :)

>
>> Cheers,
>> Cédric
>>
>
> Any references which i can read further for more knowledge in this
> area?   I think i must be wrong, as all these sounds very new to me.
>

Well, even if those are rather simple concepts after all (even though
it may not look so given my explainations), that's something really
common i guess...I've personnaly been spending some time on that
amusing problem but appart from networking guys, there must not be
many users of that technique.

Historically, the idea of maintaining a "pin-down" cache was first
introduced by Tezuka et al.
"Pin-down Cache: A Virtual Memory Management Technique for Zero-copy
Communication" (1998)
http://www.pccluster.org/score/papers/tezuka-ipps98.ps.gz

Brice Goglin did some nice things about distributed file systems on
Myrinet networks during his phd.
Provided he was working in the kernel, and that there is no such
things as glibc hooks there, he did some "VMA spy" patch which intent
was very similar to the MMU notifiers: when a monitored VMA is
modified, the cache is scanned to evict possibly invalid buffers.
"An Efficient Network API for in-Kernel Applications in Clusters" (2005)
http://hal.inria.fr/inria-00070445

Personnaly, I tried to extend that work for the Myrinet eXpress (MX)
library at the user level : the user process puts "hooks" on VMA, and
when a VMA is modified (for instance when a file is unmapped), the
hook invalidates the cache at the driver level in the kernel, so that
the user level library can detect it later on.
"Interval-based registration cache for zero-copy protocols" (2007)
http://runtime.bordeaux.inria.fr/augonnet/myricom/Rapport-StageM1.pdf

Of course, not only myrinet people are using that technique, there are
similar works for others networks, see Wyckoff's work for this on
InfiniBand :
"Memory registration caching correctness" (2005)
http://www.osc.edu/~pw/papers/wyckoff-memreg-ccgrid05.pdf

Also, it is interesting to note that Quadrics NICs maintain a copy of
the process page table in the NIC memory, maintaining this table
consistent also required the use of a MMU-notifiers like patch which
was called "IOProc"
http://lkml.org/lkml/2005/4/26/198

Who said there was nobody asking for some MMU-notifiers facility
appart virtualization ? :) I'm sure there are many others, not only in
networking !

Cheers,
Cédric

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ