On 14/03/18 10:20, Miklos Szeredi wrote: > On Tue, Mar 13, 2018 at 7:56 PM, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: >> On Tue, Mar 13, 2018 at 07:15:46PM +0200, Boaz Harrosh wrote: >>> On a call to mmap an mmap provider (like an FS) can put >>> this flag on vma->vm_flags. >>> >>> This tells the Kernel that the vma will be used from a single >>> core only and therefore invalidation of PTE(s) need not a >>> wide CPU scheduling >>> >>> The motivation of this flag is the ZUFS project where we want >>> to optimally map user-application buffers into a user-mode-server >>> execute the operation and efficiently unmap. >> >> I've been looking at something similar, and I prefer my approach, >> although I'm not nearly as far along with my implementation as you are. >> >> My approach is also to add a vm_flags bit, tentatively called VM_NOTLB. >> The page fault handler refuses to insert any TLB entries into the process >> address space. But follow_page_mask() will return the appropriate struct >> page for it. This should be enough for O_DIRECT accesses to work as >> you'll get the appropriate scatterlists built. >> >> I suspect Boaz has already done a lot of thinking about this and doesn't >> need the explanation, but here's how it looks for anyone following along >> at home: >> >> Process A calls read(). >> Kernel allocates a page cache page for it and calls the filesystem through >> ->readpages (or ->readpage). >> Filesystem calls the managing process to get the data for that page. >> Managing process draws a pentagram and summons Beelzebub (or runs Perl; >> whichever you find more scary). >> Managing process notifies the filesystem that the page is now full of data. >> Filesystem marks the page as being Uptodate and unlocks it. >> Process was waiting on the page lock, wakes up and copies the data from the >> page cache into userspace. read() is complete. >> >> What we're concerned about here is what to do after the managing process >> tells the kernel that the read is complete. Clearly allowing the managing >> process continued access to the page is Bad as the page may be freed by the >> page cache and then reused for something else. Doing a TLB shootdown is >> expensive. So Boaz's approach is to have the process promise that it won't >> have any other thread look at it. My approach is to never allow the page >> to have load/store access from userspace; it can only be passed to other >> system calls. > Hi Matthew, Hi Miklos Thank you for looking at this. I'm answering both Matthew an Miklos's all thread, by trying to explain something that you might not have completely wrapped around yet. Matthew first Please note that in the ZUFS system there are no page-faults at all involved (God no, this is like +40us minimum and I'm fighting to shave off 13us) In ZUF-to-ZUS communication command comes in: A1 we punch in the pages at the per-core-VMA before they are used, A2 we then return to user-space, access these pages once. (without any page faults) A3 Then return to kernel and punch in a drain page at that spot New command comes in: B1 we punch in the pages at the same per-core-VMA before they are used, B2 Return to user-space, access these new pages once. B3 Then return to kernel and punch in a drain page at that spot Actually I could skip A3/B3 all together but in testing after my patch it did not cost at all, so I like the extra easiness (Because otherwise there is a dance I need to do when app or server crash and files start to close I need to scan VMAs and zap them) Current mm's mapping code (at insert_pfn) will fail at B1 above. Because it wants to see a ZERO empty spot before inserting a new pte. What the mm code wants is that I call A3 - zap_vma_ptes(vma) This is because if the spot was not ZERO it means there was a previous mapping there. And some other core might have cached that entry at the TLB. so when I punch in this new value the other core could access the old page while this core is accessing the new page. (TLB-invalidate is a single core command and is why zap_vma_ptes needs to schedule all cores to each call TLB-invalidate) And this is the all difference between the two testes above. That I do not zap_vma_ptes With the new (one liner) code. Please Note that the VM_LOCAL_CPU flag is not set by the application (zus Server) but by the Kernel driver, telling the Kernel that it has enforced such an API that we access from a single CORE so please allow me B1 because I know what I'm doing. (Also we do put some trust into zus because it has our filesystem data and because we wrote it ;-)) I understand your approach where you say "The PTE table is just a global communicator of pages but is not really mapped into any process .i.e never faulted into any core's local-TLB" (The Kernel access of that memory is done on a Kernel address at another TLB). And is why I can get away from zap_vma_ptes(vma). So is this not the same thing? your flag says no one TLB cached this PTE my flag says only-this-core-cached this PTE. We both ask "So please skip the zap_vma_ptes(vma) stage for me" I think you might be able to use my flag for your system. Is only a small part of what you need with the all "Get the page from the PTE at" and so on. But the "please skip zap_vma_ptes(vma)" part is this patch here, No? BTW I did not at all understand what is your project trying to solve. please send me some Notes about it I want to see if they might fit after all > This all seems to revolve around the fact that userspace fs server > process needs to copy something into userspace client's buffer, right? > > Instead of playing with memory mappings, why not just tell the kernel > *what* to copy? > > While in theory not as generic, I don't see any real limitations (you > don't actually need the current contents of the buffer in the read > case and vica verse in the write case). > This is not so easy, for many reasons. It was actually my first approach which I pursued for a while but dropped it for the easier to implement and more general approach. Note that we actually do that in the implementation of mmap. There is a ZUS_OP_GET_BLOCK which returns a dpp_t of a page to map into the application's VM. We could just copy it at that point We have some app buffers arriving with pointers local to one VM (the app) and then we want to copy them to another app buffers. How do you do that? So yes you need to get_user_pages() so they can be accessed from kernel, switch to second VM then receive pointers there. These need to be dpp_t like the games I do, or - In the app context copy_user_to_page. But that API was not enough for me. Because this is good with pmem. But what if I actually want it from disk or network. My API you can do that easily without any copy or caching still. Not in this RFC - but there is a plan (Is my very next todo) for an ASYNC operation mode as well as the sync operation. The zus is telling ZUF => -ASYNC please the data you wanted is on slow media I need to sleep. The request is put on hold and completed in the background. An async thread will later call to complete the command. Note that in that case we will do zap_vma_ptes(vma). And back to square one. But in that case the cost of zap_vma_ptes(vma) is surly accepted. Also there was a very big locking problem with the OP_GET_BLOCK approach. Because the while a copy is made, FS needs to lock access to that same page in many kind of scenarios. Just few examples: 1- COW writer a concurrent reader should see the old data. 2- unwritten-buffer-write - concurrent reader should see zeros which means I need to write zeros first, before letting reads in. Grrr this is current DAX code. I know how to do better 3- tier-down - I want to write a page to slow media and reuse it. Must not allow this while it is accessed. And many more. So in all these cases the API will need to be OP_GET_BLOCK / OP_PUT_BLOCK which is two trips. Fffff very slow. And specially in the network or from-device case, the zus server needs to now have all these buffer cache management and life time hell. because it needs to read this data somewhere before it presents the page back to Kernel, and there you have a COPY for you. In my API you can network directly to the APP buffers they are there why not use them. (Did I say ZERO copy ;-) ) Also Psuedo FS application servers say like MySQL-5. OP_GET_BLOCK will give it a big memory management problem. where now we can just write directly to app buffers, and again in zero copy. Please note that it will be very easy with this API to also support page-cache for FSs that wants it like the network FSs you said. The FS will set a bit in the fs_register call to say that it would rather use page cache. These type of FSs will run on a different kind of BDI which says "Yes page cache please". All the IO entry vectors point to the generic_iter API and instead we implement read/write_pages(). At read/write_pages() we do the exact same OP_READ/WRITE like today. map the cache pages to the zus VM, despatch, return, release page_lock. all is happy. Any one wanting to contribute this is very welcome. I did have plans in that first approach to have a cache of OP_GET_BLOCKs on the radix tree. And have the Server recall these blocks when needed. But this called for alot of locking on the HOT path. And was much much more complicated, bigger code. Here we have a completely lockless, zero synchronization between cores code. With the one liner of this patch even the all vma_mapping is lockless And is so very simple, with a huge gain and no loss. Because .... You said above: "Instead of playing with memory mappings" But if you look at the amount of code, even compared to a pipe or spline. You will see that the "playing with memory mappings" is so very easy and simple. It might be new, hard to grasp approach but it is just harder as a new concept then an actually code complexity. All I actually do is: 1. Allocate a vma per core 2. call vm_insert_pfn .... Do something 3. vm_insert_pfn(NULL) (before this patch zap_vma_ptes()) It is all very simple really. For me it is opposite. It is "Why mess around with dual_port_pointers, caching, and copy life time rules, when you can just call vm_insert_pfn" > And we already have an interface for this: splice(2). What am I > missing? What's the killer argument in favor of the above messing > with tlb caches etc, instead of just letting the kernel do the dirty > work. > You answered yourself. We are the Kernel and we are doing the (simple) work. If you look at all this from far, the zus-core with its Z-Threads array is just a fancy pipe really - A zero copy pipe. Being a splice API gives us nothing. It will have the same problems as above. splice basically says Party A show me your buffers Party B show me yours, and I can copy between them in the Kernel. Usually one of them A or B is in Kernel buffers or a DMA target. So this case is very like the OPT_GET_BLOCK you have life time problems. And if you use the direct mmaped pipe like you talked to Matthew about then you are back to this exact problem and with current API you cannot avoid neither the zap_vma_ptes() nor actual page-faults after the mmap. So you are looking at 60u minimum I have the all round trip in 4.6u. And I believe I can cut it down to 3.5u with fixing that Relay object I have researched this for a while. I do not believe there is a more rubust, and certainly this one liner is not complexity either. > Thanks, > Miklos > I Hope this sheds some light on the matter. Thankyou Boaz