Re: Need help with the mmu notifiers patches.

"Cédric Augonnet" <cedric.augonnet@xxxxxxxxx> · Sun, 27 Jul 2008 15:24:34 +0200

2008/7/27 Mulyadi Santosa <mulyadi.santosa@xxxxxxxxx>:
> Hi...
>
> On Sun, Jul 27, 2008 at 12:49 PM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
>> Hi, Arn,
>>
>> so have u done anything so far?   Frankly, i don't understand what is
>> the purpose of MMU notification patch - anyone can explain?
>>
>> For example, one particular explanation is given here:
>>
>> http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-01/msg04810.html
>>
>> So my question is:   is the "shadow pagetable" referring to that of
>> the KVM/XEN's shadow pagetable?   Linux kernel without virtualization
>> does not have "shadow pagetable" right?
>
> yes, shadow pagetable is "fake pgd/pmd/pte" created by hypervisor to
> mimic the work of MMU on real hardware.
>
> So, by using MMU notifier, the best I can conclude after quickly
> reading Andrea's message..is to notice if certain memory region is
> freed from shadow pagetables.
>
> regards,
>
> Mulyadi.
>

Hi,

High performance networks are also a perfect target. Provided the need
to reduce CPU usage and to avoid useless costly copies, high speed
network interfaces cards (NICs)  do DMA data directly from or to the
user-space memory (this is usually called "zero-copy" protocols, and
RDMA is a form of zero-copy protocol).

Imagine that a process wants to send some data to a network card :
from the application perspective, this data is for instance described
as a virtual address interval (start virtuel address + length).
unfortunately most NICs cannot manipulate virtual addresses but only
addresses in the PCI space.

So in order to send a message, the network library would have to do
the following :
1 - Do a system call to the NIC driver
2 - for each page in virtual address space, perform a "get_user_page"
which will translate the virtual address into a pci address, which the
NIC can manipulate. get_user_page will also "pin" the memory so that
it cannot be swapped out for instance. This is also called memory
registration or memory pinning.
3 - The NIC will be given the list of physical address (in pci space)
so that it may DMA all the data directly from the application address
space.
4 - When the communication is finished, the driver will put all pages
back so that it can be reclaimed if needed.

Now imagine that for some reason, an application will always reuse the
same buffers (this is rather likely). Continously registering and
unregistering the very same buffer is a serious overhead we cannot
cope with in the context of high performance networking.

To avoid that, various people start using a 'registration cache' to
avoid useless memory (de)registration. The last n buffers are kept
pinned, thus when the NIC needs those buffers again, it already has a
virtual-to-physical address translation which greatly reduces the
system overhead for that communication.

Unfortunately imagine the following scenario : process P maps file F1
in buffer B, P sends B over the network, (B is still pinned), P unmaps
F1 and maps F2 in B. P sends B again.
So there the NIC assumes the physical addresses of B are still
valids:it may therefore actually send the content of F1, or more
likely junk ... which you absolutely *never* want of course.

So would you drop zero-copy protocols ? that's a real waste. Instead,
we need a way to monitor changes in the virtual memory mapping so that
the networking lib may invalidate the buffers when needed. There are
currently various approaches to that problem :
For glibc users (that's already a restriction) just use the glibc
hooks : when you call a remap, a free or whatever which *may* change
the mapping, you first call a hook that will keep the cache valid.
This is often not possible  as some application are already statically
linked with the glibc so that you cannot use those hooks afterwards.
Also, some applications are also using such hooks so that the network
lib cannot overwrite them ... Also imagine you have various network
drivers which have the same need to use hooks (a process using both IB
and myri10g for instance).

For all those reasons, we do need a facility to monitor changes in the
virtual memory mapping in the context of high performance netwoking,
or even for GPGPUs or any device which heavily rely on DMA transfers
between device and user-space.

I was mentionning heterogeneous networks (eg. myri10g + IB +...) but
in the GPGPU era, you have GPGPU + NICs possibly used by the same
process in the case of a GPU-enabled cluster.

Sorry for so lengthy mail, i hope it was not too unclear and that you
are now convinced we DO want the mmu notifiers :)

Cheers,
Cédric

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ