On Wed, Mar 14, 2018 at 12:17 PM, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > On Wed, Mar 14, 2018 at 09:20:57AM +0100, Miklos Szeredi wrote: >> On Tue, Mar 13, 2018 at 7:56 PM, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: >> > On Tue, Mar 13, 2018 at 07:15:46PM +0200, Boaz Harrosh wrote: >> >> On a call to mmap an mmap provider (like an FS) can put >> >> this flag on vma->vm_flags. >> >> >> >> This tells the Kernel that the vma will be used from a single >> >> core only and therefore invalidation of PTE(s) need not a >> >> wide CPU scheduling >> >> >> >> The motivation of this flag is the ZUFS project where we want >> >> to optimally map user-application buffers into a user-mode-server >> >> execute the operation and efficiently unmap. >> > >> > I've been looking at something similar, and I prefer my approach, >> > although I'm not nearly as far along with my implementation as you are. >> > >> > My approach is also to add a vm_flags bit, tentatively called VM_NOTLB. >> > The page fault handler refuses to insert any TLB entries into the process >> > address space. But follow_page_mask() will return the appropriate struct >> > page for it. This should be enough for O_DIRECT accesses to work as >> > you'll get the appropriate scatterlists built. >> > >> > I suspect Boaz has already done a lot of thinking about this and doesn't >> > need the explanation, but here's how it looks for anyone following along >> > at home: >> > >> > Process A calls read(). >> > Kernel allocates a page cache page for it and calls the filesystem through >> > ->readpages (or ->readpage). >> > Filesystem calls the managing process to get the data for that page. >> > Managing process draws a pentagram and summons Beelzebub (or runs Perl; >> > whichever you find more scary). >> > Managing process notifies the filesystem that the page is now full of data. >> > Filesystem marks the page as being Uptodate and unlocks it. >> > Process was waiting on the page lock, wakes up and copies the data from the >> > page cache into userspace. read() is complete. >> > >> > What we're concerned about here is what to do after the managing process >> > tells the kernel that the read is complete. Clearly allowing the managing >> > process continued access to the page is Bad as the page may be freed by the >> > page cache and then reused for something else. Doing a TLB shootdown is >> > expensive. So Boaz's approach is to have the process promise that it won't >> > have any other thread look at it. My approach is to never allow the page >> > to have load/store access from userspace; it can only be passed to other >> > system calls. >> >> This all seems to revolve around the fact that userspace fs server >> process needs to copy something into userspace client's buffer, right? >> >> Instead of playing with memory mappings, why not just tell the kernel >> *what* to copy? >> >> While in theory not as generic, I don't see any real limitations (you >> don't actually need the current contents of the buffer in the read >> case and vica verse in the write case). >> >> And we already have an interface for this: splice(2). What am I >> missing? What's the killer argument in favor of the above messing >> with tlb caches etc, instead of just letting the kernel do the dirty >> work. > > Great question. You're completely right that the question is how to tell > the kernel what to copy. The problem is that splice() can only write to > the first page of a pipe. So you need one pipe per outstanding request, > which can easily turn into thousands of file descriptors. If we enhanced > splice() so it could write to any page in a pipe, then I think splice() > would be the perfect interface. Don't know your usecase, but afaict zufs will have one queue per cpu. Having one pipe/cpu doesn't sound too bad. But yeah, there's plenty of room for improvement in the splice interface. Just needs a killer app like this :) Thanks, Miklos