2008/7/27 Mulyadi Santosa <mulyadi.santosa@xxxxxxxxx>: > Hi... > > On Sun, Jul 27, 2008 at 12:49 PM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote: >> Hi, Arn, >> >> so have u done anything so far? Frankly, i don't understand what is >> the purpose of MMU notification patch - anyone can explain? >> >> For example, one particular explanation is given here: >> >> http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-01/msg04810.html >> >> So my question is: is the "shadow pagetable" referring to that of >> the KVM/XEN's shadow pagetable? Linux kernel without virtualization >> does not have "shadow pagetable" right? > > yes, shadow pagetable is "fake pgd/pmd/pte" created by hypervisor to > mimic the work of MMU on real hardware. > > So, by using MMU notifier, the best I can conclude after quickly > reading Andrea's message..is to notice if certain memory region is > freed from shadow pagetables. > > regards, > > Mulyadi. > Hi, High performance networks are also a perfect target. Provided the need to reduce CPU usage and to avoid useless costly copies, high speed network interfaces cards (NICs) do DMA data directly from or to the user-space memory (this is usually called "zero-copy" protocols, and RDMA is a form of zero-copy protocol). Imagine that a process wants to send some data to a network card : from the application perspective, this data is for instance described as a virtual address interval (start virtuel address + length). unfortunately most NICs cannot manipulate virtual addresses but only addresses in the PCI space. So in order to send a message, the network library would have to do the following : 1 - Do a system call to the NIC driver 2 - for each page in virtual address space, perform a "get_user_page" which will translate the virtual address into a pci address, which the NIC can manipulate. get_user_page will also "pin" the memory so that it cannot be swapped out for instance. This is also called memory registration or memory pinning. 3 - The NIC will be given the list of physical address (in pci space) so that it may DMA all the data directly from the application address space. 4 - When the communication is finished, the driver will put all pages back so that it can be reclaimed if needed. Now imagine that for some reason, an application will always reuse the same buffers (this is rather likely). Continously registering and unregistering the very same buffer is a serious overhead we cannot cope with in the context of high performance networking. To avoid that, various people start using a 'registration cache' to avoid useless memory (de)registration. The last n buffers are kept pinned, thus when the NIC needs those buffers again, it already has a virtual-to-physical address translation which greatly reduces the system overhead for that communication. Unfortunately imagine the following scenario : process P maps file F1 in buffer B, P sends B over the network, (B is still pinned), P unmaps F1 and maps F2 in B. P sends B again. So there the NIC assumes the physical addresses of B are still valids:it may therefore actually send the content of F1, or more likely junk ... which you absolutely *never* want of course. So would you drop zero-copy protocols ? that's a real waste. Instead, we need a way to monitor changes in the virtual memory mapping so that the networking lib may invalidate the buffers when needed. There are currently various approaches to that problem : For glibc users (that's already a restriction) just use the glibc hooks : when you call a remap, a free or whatever which *may* change the mapping, you first call a hook that will keep the cache valid. This is often not possible as some application are already statically linked with the glibc so that you cannot use those hooks afterwards. Also, some applications are also using such hooks so that the network lib cannot overwrite them ... Also imagine you have various network drivers which have the same need to use hooks (a process using both IB and myri10g for instance). For all those reasons, we do need a facility to monitor changes in the virtual memory mapping in the context of high performance netwoking, or even for GPGPUs or any device which heavily rely on DMA transfers between device and user-space. I was mentionning heterogeneous networks (eg. myri10g + IB +...) but in the GPGPU era, you have GPGPU + NICs possibly used by the same process in the case of a GPU-enabled cluster. Sorry for so lengthy mail, i hope it was not too unclear and that you are now convinced we DO want the mmu notifiers :) Cheers, Cédric -- To unsubscribe from this list: send an email with "unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx Please read the FAQ at http://kernelnewbies.org/FAQ