Re: [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU

Boaz Harrosh <boazh@xxxxxxxxxx> · Wed, 14 Mar 2018 23:41:48 +0200

On 14/03/18 10:20, Miklos Szeredi wrote:
> On Tue, Mar 13, 2018 at 7:56 PM, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>> On Tue, Mar 13, 2018 at 07:15:46PM +0200, Boaz Harrosh wrote:
>>> On a call to mmap an mmap provider (like an FS) can put
>>> this flag on vma->vm_flags.
>>>
>>> This tells the Kernel that the vma will be used from a single
>>> core only and therefore invalidation of PTE(s) need not a
>>> wide CPU scheduling
>>>
>>> The motivation of this flag is the ZUFS project where we want
>>> to optimally map user-application buffers into a user-mode-server
>>> execute the operation and efficiently unmap.
>>
>> I've been looking at something similar, and I prefer my approach,
>> although I'm not nearly as far along with my implementation as you are.
>>
>> My approach is also to add a vm_flags bit, tentatively called VM_NOTLB.
>> The page fault handler refuses to insert any TLB entries into the process
>> address space.  But follow_page_mask() will return the appropriate struct
>> page for it.  This should be enough for O_DIRECT accesses to work as
>> you'll get the appropriate scatterlists built.
>>
>> I suspect Boaz has already done a lot of thinking about this and doesn't
>> need the explanation, but here's how it looks for anyone following along
>> at home:
>>
>> Process A calls read().
>> Kernel allocates a page cache page for it and calls the filesystem through
>>   ->readpages (or ->readpage).
>> Filesystem calls the managing process to get the data for that page.
>> Managing process draws a pentagram and summons Beelzebub (or runs Perl;
>>   whichever you find more scary).
>> Managing process notifies the filesystem that the page is now full of data.
>> Filesystem marks the page as being Uptodate and unlocks it.
>> Process was waiting on the page lock, wakes up and copies the data from the
>>   page cache into userspace.  read() is complete.
>>
>> What we're concerned about here is what to do after the managing process
>> tells the kernel that the read is complete.  Clearly allowing the managing
>> process continued access to the page is Bad as the page may be freed by the
>> page cache and then reused for something else.  Doing a TLB shootdown is
>> expensive.  So Boaz's approach is to have the process promise that it won't
>> have any other thread look at it.  My approach is to never allow the page
>> to have load/store access from userspace; it can only be passed to other
>> system calls.
> 

Hi Matthew, Hi Miklos

Thank you for looking at this.
I'm answering both Matthew an Miklos's all thread, by trying to explain
something that you might not have completely wrapped around yet.

Matthew first

Please note that in the ZUFS system there are no page-faults at all involved
(God no, this is like +40us minimum and I'm fighting to shave off 13us)

In ZUF-to-ZUS communication
command comes in:
A1 we punch in the pages at the per-core-VMA before they are used,
A2 we then return to user-space, access these pages once.
   (without any page faults)
A3 Then return to kernel and punch in a drain page at that spot

New command comes in:
B1 we punch in the pages at the same per-core-VMA before they are used,
B2 Return to user-space, access these new pages once.
B3 Then return to kernel and punch in a drain page at that spot

Actually I could skip A3/B3 all together but in testing after my patch
it did not cost at all, so I like the extra easiness (Because otherwise
there is a dance I need to do when app or server crash and files start
to close I need to scan VMAs and zap them)

Current mm's mapping code (at insert_pfn) will fail at B1 above. Because
it wants to see a ZERO empty spot before inserting a new pte.
What the mm code wants is that I call
	A3 - zap_vma_ptes(vma)

This is because if the spot was not ZERO it means there was a previous
mapping there. And some other core might have cached that entry at the
TLB. so when I punch in this new value the other core could access
the old page while this core is accessing the new page.
(TLB-invalidate is a single core command and is why zap_vma_ptes
 needs to schedule all cores to each call TLB-invalidate)

And this is the all difference between the two testes above. That I do not
zap_vma_ptes With the new (one liner) code.

Please Note that the VM_LOCAL_CPU flag is not set by the application (zus Server)
but by the Kernel driver, telling the Kernel that it has enforced such an API
that we access from a single CORE so please allow me B1 because I know what I'm
doing. (Also we do put some trust into zus because it has our filesystem  data
and because we wrote it ;-))

I understand your approach where you say "The PTE table is just a global
communicator of pages but is not really mapped into any process .i.e never
faulted into any core's local-TLB" (The Kernel access of that memory is done
on a Kernel address at another TLB). And is why I can get away from
zap_vma_ptes(vma).
So is this not the same thing? your flag says no one TLB cached this PTE
my flag says only-this-core-cached this PTE. We both ask
"So please skip the zap_vma_ptes(vma) stage for me"

I think you might be able to use my flag for your system. Is only a small
part of what you need with the all "Get the page from the PTE at" and
so on. But the "please skip zap_vma_ptes(vma)" part is this patch here, No?

BTW I did not at all understand what is your project trying to solve.
please send me some Notes about it I want to see if they might fit
after all

> This all seems to revolve around the fact that userspace fs server
> process needs to copy something into userspace client's buffer, right?
> 
> Instead of playing with memory mappings, why not just tell the kernel
> *what* to copy?
> 
> While in theory not as generic, I don't see any real limitations (you
> don't actually need the current contents of the buffer in the read
> case and vica verse in the write case).
> 

This is not so easy, for many reasons. It was actually my first approach
which I pursued for a while but dropped it for the easier to implement
and more general approach.

Note that we actually do that in the implementation of mmap. There is
a ZUS_OP_GET_BLOCK which returns a dpp_t of a page to map into the
application's VM. We could just copy it at that point

We have some app buffers arriving with pointers local to one VM (the app)
and then we want to copy them to another app buffers. How do you do that?
So yes you need to get_user_pages() so they can be accessed from kernel, switch
to second VM then receive pointers there. These need to be dpp_t like the games
I do, or - In the app context copy_user_to_page.

But that API was not enough for me. Because this is good with pmem.
But what if I actually want it from disk or network. My API you can do
that easily without any copy or caching still.
Not in this RFC - but there is a plan (Is my very next todo) for an
ASYNC operation mode as well as the sync operation. The zus is telling
ZUF => -ASYNC please the data you wanted is on slow media I need
to sleep. The request is put on hold and completed in the background.
An async thread will later call to complete the command. Note that in
that case we will do zap_vma_ptes(vma). And back to square one. But in
that case the cost of zap_vma_ptes(vma) is surly accepted.

Also there was a very big locking problem with the OP_GET_BLOCK
approach. Because the while a copy is made, FS needs to lock access to
that same page in many kind of scenarios. Just few examples:
1- COW writer a concurrent reader should see the old data.
2- unwritten-buffer-write - concurrent reader should see zeros which
   means I need to write zeros first, before letting reads in. Grrr
   this is current DAX code. I know how to do better
3- tier-down - I want to write a page to slow media and reuse it. Must not
   allow this while it is accessed.

And many more. So in all these cases the API will need to be
OP_GET_BLOCK / OP_PUT_BLOCK which is two trips. Fffff very slow.

And specially in the network or from-device case, the zus server needs
to now have all these buffer cache management and life time hell.
because it needs to read this data somewhere before it presents the page
back to Kernel, and there you have a COPY for you.

In my API you can network directly to the APP buffers they are there
why not use them. (Did I say ZERO copy ;-) )

Also Psuedo FS application servers say like MySQL-5. OP_GET_BLOCK
will give it a big memory management problem. where now we can just
write directly to app buffers, and again in zero copy.

Please note that it will be very easy with this API to also support
page-cache for FSs that wants it like the network FSs you said.
The FS will set a bit in the fs_register call to say that it would
rather use page cache. These type of FSs will run on a different
kind of BDI which says "Yes page cache please". All the IO entry
vectors point to the generic_iter API and instead we implement
read/write_pages(). At read/write_pages() we do the exact same OP_READ/WRITE
like today. map the cache pages to the zus VM, despatch, return, release page_lock.
all is happy. Any one wanting to contribute this is very welcome.

I did have plans in that first approach to have a cache of OP_GET_BLOCKs
on the radix tree. And have the Server recall these blocks when needed.
But this called for alot of locking on the HOT path. And was much much
more complicated, bigger code.
Here we have a completely lockless, zero synchronization between cores
code. With the one liner of this patch even the all vma_mapping is lockless
And is so very simple, with a huge gain and no loss. Because ....

You said above: "Instead of playing with memory mappings"

But if you look at the amount of code, even compared to a pipe
or spline. You will see that the "playing with memory mappings"
is so very easy and simple. It might be new, hard to grasp approach
but it is just harder as a new concept then an actually code
complexity. All I actually do is:

1. Allocate a vma per core
2. call vm_insert_pfn
 .... Do something
3. vm_insert_pfn(NULL) (before this patch zap_vma_ptes())

It is all very simple really. For me it is opposite. It is
"Why mess around with dual_port_pointers, caching, and copy
 life time rules, when you can just call vm_insert_pfn"

> And we already have an interface for this: splice(2).  What am I
> missing?  What's the killer argument in favor of the above messing
> with tlb caches etc, instead of just letting the kernel do the dirty
> work.
> 

You answered yourself. We are the Kernel and we are doing the (simple)
work. If you look at all this from far, the zus-core with its Z-Threads
array is just a fancy pipe really - A zero copy pipe.

Being a splice API gives us nothing. It will have the same problems
as above. splice basically says Party A show me your buffers Party B
show me yours, and I can copy between them in the Kernel. Usually one
of them A or B is in Kernel buffers or a DMA target. So this case is very
like the OPT_GET_BLOCK you have life time problems. And if you use the direct
mmaped pipe like you talked to Matthew about then you are back to this exact
problem and with current API you cannot avoid neither the zap_vma_ptes() nor
actual page-faults after the mmap. So you are looking at 60u minimum
I have the all round trip in 4.6u. And I believe I can cut it down to
3.5u with fixing that Relay object

I have researched this for a while. I do not believe there is a more
rubust, and certainly this one liner is not complexity either.

> Thanks,
> Miklos
> 

I Hope this sheds some light on the matter.

Thankyou
Boaz