Re: Need help with the mmu notifiers patches.

"Peter Teoh" <htmldeveloper@xxxxxxxxx> · Mon, 28 Jul 2008 13:57:26 +0800

On Sun, Jul 27, 2008 at 9:24 PM, Cédric Augonnet
<cedric.augonnet@xxxxxxxxx> wrote:
> 2008/7/27 Mulyadi Santosa <mulyadi.santosa@xxxxxxxxx>:
>> Hi...
>>
>> On Sun, Jul 27, 2008 at 12:49 PM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
>>> Hi, Arn,
>>>
>>> so have u done anything so far?   Frankly, i don't understand what is
>>> the purpose of MMU notification patch - anyone can explain?
>>>
>>> For example, one particular explanation is given here:
>>>
>>> http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-01/msg04810.html
>>>
>>> So my question is:   is the "shadow pagetable" referring to that of
>>> the KVM/XEN's shadow pagetable?   Linux kernel without virtualization
>>> does not have "shadow pagetable" right?
>>
>> yes, shadow pagetable is "fake pgd/pmd/pte" created by hypervisor to
>> mimic the work of MMU on real hardware.
>>
>> So, by using MMU notifier, the best I can conclude after quickly
>> reading Andrea's message..is to notice if certain memory region is
>> freed from shadow pagetables.
>>
>> regards,
>>
>> Mulyadi.
>>
>
> Hi,
>
> High performance networks are also a perfect target. Provided the need
> to reduce CPU usage and to avoid useless costly copies, high speed
> network interfaces cards (NICs)  do DMA data directly from or to the
> user-space memory (this is usually called "zero-copy" protocols, and
> RDMA is a form of zero-copy protocol).
>
> Imagine that a process wants to send some data to a network card :
> from the application perspective, this data is for instance described
> as a virtual address interval (start virtuel address + length).
> unfortunately most NICs cannot manipulate virtual addresses but only
> addresses in the PCI space.
>
> So in order to send a message, the network library would have to do
> the following :
> 1 - Do a system call to the NIC driver
> 2 - for each page in virtual address space, perform a "get_user_page"
> which will translate the virtual address into a pci address, which the
> NIC can manipulate. get_user_page will also "pin" the memory so that
> it cannot be swapped out for instance. This is also called memory
> registration or memory pinning.
> 3 - The NIC will be given the list of physical address (in pci space)
> so that it may DMA all the data directly from the application address
> space.
> 4 - When the communication is finished, the driver will put all pages
> back so that it can be reclaimed if needed.
>
> Now imagine that for some reason, an application will always reuse the
> same buffers (this is rather likely). Continously registering and
> unregistering the very same buffer is a serious overhead we cannot
> cope with in the context of high performance networking.
>
> To avoid that, various people start using a 'registration cache' to
> avoid useless memory (de)registration. The last n buffers are kept
> pinned, thus when the NIC needs those buffers again, it already has a
> virtual-to-physical address translation which greatly reduces the
> system overhead for that communication.
>
> Unfortunately imagine the following scenario : process P maps file F1
> in buffer B, P sends B over the network, (B is still pinned), P unmaps

sorry...i don't understand this part....pinning the buffer and
unmapping is done in sync right?
in other words, pinng is done in kernel - VM layer (or your driver),
and then each of the process will individually map and unmap
independently - in the userspace.

so if u unmap the memory, immediately the memory is gone away - no
longer in the process's VMA  linked list (inside the kernel), so when
swapper go through the VMA linked list, there is nothing for it to
consider whether to swap out or not - right? so what is the meaning of
buffer in "pinned" state here?

> F1 and maps F2 in B. P sends B again.
> So there the NIC assumes the physical addresses of B are still
> valids:it may therefore actually send the content of F1, or more
> likely junk ... which you absolutely *never* want of course.
>
> So would you drop zero-copy protocols ? that's a real waste. Instead,
> we need a way to monitor changes in the virtual memory mapping so that
> the networking lib may invalidate the buffers when needed. There are
> currently various approaches to that problem :
> For glibc users (that's already a restriction) just use the glibc
> hooks : when you call a remap, a free or whatever which *may* change
> the mapping, you first call a hook that will keep the cache valid.

ok....so this means that unmapping operation is not the normal
unmapping....but a modified one....when the buffer (which is userspace
virtual address) is delinked fron process P's VMA linked list of
memory buffer (in the kernel), but the content is still there,  and no
body own it.   The next time process P try to map it again, it will
use back the same buffer, and because the TLB is not flushed out,
translation will be the same, ie, virtual address to access the
physical content, will be the same again.   is that correct?

but problem is how does glibc's remap() operation knows which cache to
reuse, in its remapping operation?   there could be many unmapping
operation done before, all of which will generate a cached copy of the
buffer, and not really doing any unmapping, right?

> This is often not possible  as some application are already statically
> linked with the glibc so that you cannot use those hooks afterwards.
> Also, some applications are also using such hooks so that the network
> lib cannot overwrite them ... Also imagine you have various network
> drivers which have the same need to use hooks (a process using both IB
> and myri10g for instance).
>
> For all those reasons, we do need a facility to monitor changes in the
> virtual memory mapping in the context of high performance netwoking,
> or even for GPGPUs or any device which heavily rely on DMA transfers
> between device and user-space.
>
> I was mentionning heterogeneous networks (eg. myri10g + IB +...) but
> in the GPGPU era, you have GPGPU + NICs possibly used by the same
> process in the case of a GPU-enabled cluster.
>
> Sorry for so lengthy mail, i hope it was not too unclear and that you
> are now convinced we DO want the mmu notifiers :)
>

i like the sharing....

> Cheers,
> Cédric
>

Any references which i can read further for more knowledge in this
area?   I think i must be wrong, as all these sounds very new to me.

-- 
Regards,
Peter Teoh

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ