Thank you for the sharing....enough for me to last a week of research....thanks again :-). On Tue, Jul 29, 2008 at 11:40 PM, Cédric Augonnet <cedric.augonnet@xxxxxxxxx> wrote: > Ahh at least we got them included in linus tree :) > > Sorry for the delayed answer ... > > 2008/7/28 Peter Teoh <htmldeveloper@xxxxxxxxx>: >> On Sun, Jul 27, 2008 at 9:24 PM, Cédric Augonnet >> <cedric.augonnet@xxxxxxxxx> wrote: > [...] >>> >>> Hi, >>> >>> High performance networks are also a perfect target. Provided the need >>> to reduce CPU usage and to avoid useless costly copies, high speed >>> network interfaces cards (NICs) do DMA data directly from or to the >>> user-space memory (this is usually called "zero-copy" protocols, and >>> RDMA is a form of zero-copy protocol). >>> >>> Imagine that a process wants to send some data to a network card : >>> from the application perspective, this data is for instance described >>> as a virtual address interval (start virtuel address + length). >>> unfortunately most NICs cannot manipulate virtual addresses but only >>> addresses in the PCI space. >>> >>> So in order to send a message, the network library would have to do >>> the following : >>> 1 - Do a system call to the NIC driver >>> 2 - for each page in virtual address space, perform a "get_user_page" >>> which will translate the virtual address into a pci address, which the >>> NIC can manipulate. get_user_page will also "pin" the memory so that >>> it cannot be swapped out for instance. This is also called memory >>> registration or memory pinning. >>> 3 - The NIC will be given the list of physical address (in pci space) >>> so that it may DMA all the data directly from the application address >>> space. >>> 4 - When the communication is finished, the driver will put all pages >>> back so that it can be reclaimed if needed. >>> >>> Now imagine that for some reason, an application will always reuse the >>> same buffers (this is rather likely). Continously registering and >>> unregistering the very same buffer is a serious overhead we cannot >>> cope with in the context of high performance networking. >>> >>> To avoid that, various people start using a 'registration cache' to >>> avoid useless memory (de)registration. The last n buffers are kept >>> pinned, thus when the NIC needs those buffers again, it already has a >>> virtual-to-physical address translation which greatly reduces the >>> system overhead for that communication. >>> >>> Unfortunately imagine the following scenario : process P maps file F1 >>> in buffer B, P sends B over the network, (B is still pinned), P unmaps >> >> sorry...i don't understand this part....pinning the buffer and >> unmapping is done in sync right? >> in other words, pinng is done in kernel - VM layer (or your driver), >> and then each of the process will individually map and unmap >> independently - in the userspace. >> so if u unmap the memory, immediately the memory is gone away - no >> longer in the process's VMA linked list (inside the kernel), so when >> swapper go through the VMA linked list, there is nothing for it to >> consider whether to swap out or not - right? so what is the meaning of >> buffer in "pinned" state here? >> > > To make things simpler let's consider a message that holds on a single > page in the process address space. The pinning operation will just do > a get_user_page on that page : this increments the refcount on the > page structure and this translates the page virtual address to a > physical address. > Having such a reference, the page will not be swapped out in the case > of memory pressure. However, if the VMA is remapped by the > application, that reference will not be hold anymore, and the new > struct page corresponding to the same virtual address will not be the > one that was cached. > > So here, pinning means that we hold a reference to the page, and that > the virtual-to-physical address translation is kept stored in the > user-space driver. If you have pinned a data, you just tell the NIC to > DMA the corresponding pages given its physical address which you > "remember". > >>> F1 and maps F2 in B. P sends B again. >>> So there the NIC assumes the physical addresses of B are still >>> valids:it may therefore actually send the content of F1, or more >>> likely junk ... which you absolutely *never* want of course. >>> >>> So would you drop zero-copy protocols ? that's a real waste. Instead, >>> we need a way to monitor changes in the virtual memory mapping so that >>> the networking lib may invalidate the buffers when needed. There are >>> currently various approaches to that problem : >>> For glibc users (that's already a restriction) just use the glibc >>> hooks : when you call a remap, a free or whatever which *may* change >>> the mapping, you first call a hook that will keep the cache valid. >> >> ok....so this means that unmapping operation is not the normal >> unmapping....but a modified one....when the buffer (which is userspace >> virtual address) is delinked fron process P's VMA linked list of >> memory buffer (in the kernel), but the content is still there, and no >> body own it. The next time process P try to map it again, it will >> use back the same buffer, and because the TLB is not flushed out, >> translation will be the same, ie, virtual address to access the >> physical content, will be the same again. is that correct? >> >> but problem is how does glibc's remap() operation knows which cache to >> reuse, in its remapping operation? there could be many unmapping >> operation done before, all of which will generate a cached copy of the >> buffer, and not really doing any unmapping, right? > > This is actually at a much higher level than TLB and so on, think of > that cache as something really more simple : it is just a list of > pinned buffers. When you send a buffer, you first scan the list to see > if you have the corresponding entry that covers the given address > range in your cache, if so, just send the corresponding pages without > asking the driver to translate the addresses; if not, register the > buffer in our cache and send it as well. > > So, here would be a sketchy _user-space_ function of the network > library which would tell the NIC which page to DMA in: > > msg_send(addr, len) > { > for all buffer in the list of pinned buffers { > if [addr, len[ is included in buffer { > /* use the cache */ > send the list of physical pages to the NIC > return; > } > } > > /* the buffer was not in the cache */ > pin the buffer > add the buffer and its memory translation (the list of physical > pages addresses) to the cache > send the list of physical pages to the NIC > } > > The glibc hooks are just wrappers around the usual calls, for instance > here is what the munmap hook could look like: > > munmap_hook (addr, len, ...) > { > for all buffer in the list of pinned buffers { > if [addr; addr + len [ intersects the buffer > invalidate the buffer (drop it from the list and unpin it) > } > > munmap(addr, len, ...); > } > > So the hook eventually calls the usual munmap function, but we first > scan the cache (which is a simple list) to evict any buffer > potentially invalidated by the munmap operation. > >>> This is often not possible as some application are already statically >>> linked with the glibc so that you cannot use those hooks afterwards. >>> Also, some applications are also using such hooks so that the network >>> lib cannot overwrite them ... Also imagine you have various network >>> drivers which have the same need to use hooks (a process using both IB >>> and myri10g for instance). >>> >>> For all those reasons, we do need a facility to monitor changes in the >>> virtual memory mapping in the context of high performance netwoking, >>> or even for GPGPUs or any device which heavily rely on DMA transfers >>> between device and user-space. >>> >>> I was mentionning heterogeneous networks (eg. myri10g + IB +...) but >>> in the GPGPU era, you have GPGPU + NICs possibly used by the same >>> process in the case of a GPU-enabled cluster. >>> >>> Sorry for so lengthy mail, i hope it was not too unclear and that you >>> are now convinced we DO want the mmu notifiers :) >>> >> >> i like the sharing.... > > That must be the intent of such ML after all :) > >> >>> Cheers, >>> Cédric >>> >> >> Any references which i can read further for more knowledge in this >> area? I think i must be wrong, as all these sounds very new to me. >> > > Well, even if those are rather simple concepts after all (even though > it may not look so given my explainations), that's something really > common i guess...I've personnaly been spending some time on that > amusing problem but appart from networking guys, there must not be > many users of that technique. > > Historically, the idea of maintaining a "pin-down" cache was first > introduced by Tezuka et al. > "Pin-down Cache: A Virtual Memory Management Technique for Zero-copy > Communication" (1998) > http://www.pccluster.org/score/papers/tezuka-ipps98.ps.gz > > Brice Goglin did some nice things about distributed file systems on > Myrinet networks during his phd. > Provided he was working in the kernel, and that there is no such > things as glibc hooks there, he did some "VMA spy" patch which intent > was very similar to the MMU notifiers: when a monitored VMA is > modified, the cache is scanned to evict possibly invalid buffers. > "An Efficient Network API for in-Kernel Applications in Clusters" (2005) > http://hal.inria.fr/inria-00070445 > > Personnaly, I tried to extend that work for the Myrinet eXpress (MX) > library at the user level : the user process puts "hooks" on VMA, and > when a VMA is modified (for instance when a file is unmapped), the > hook invalidates the cache at the driver level in the kernel, so that > the user level library can detect it later on. > "Interval-based registration cache for zero-copy protocols" (2007) > http://runtime.bordeaux.inria.fr/augonnet/myricom/Rapport-StageM1.pdf > > Of course, not only myrinet people are using that technique, there are > similar works for others networks, see Wyckoff's work for this on > InfiniBand : > "Memory registration caching correctness" (2005) > http://www.osc.edu/~pw/papers/wyckoff-memreg-ccgrid05.pdf > > Also, it is interesting to note that Quadrics NICs maintain a copy of > the process page table in the NIC memory, maintaining this table > consistent also required the use of a MMU-notifiers like patch which > was called "IOProc" > http://lkml.org/lkml/2005/4/26/198 > > Who said there was nobody asking for some MMU-notifiers facility > appart virtualization ? :) I'm sure there are many others, not only in > networking ! > > Cheers, > Cédric > -- Regards, Peter Teoh -- To unsubscribe from this list: send an email with "unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx Please read the FAQ at http://kernelnewbies.org/FAQ