On Wed, Mar 14, 2018 at 09:20:57AM +0100, Miklos Szeredi wrote: > On Tue, Mar 13, 2018 at 7:56 PM, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > On Tue, Mar 13, 2018 at 07:15:46PM +0200, Boaz Harrosh wrote: > >> On a call to mmap an mmap provider (like an FS) can put > >> this flag on vma->vm_flags. > >> > >> This tells the Kernel that the vma will be used from a single > >> core only and therefore invalidation of PTE(s) need not a > >> wide CPU scheduling > >> > >> The motivation of this flag is the ZUFS project where we want > >> to optimally map user-application buffers into a user-mode-server > >> execute the operation and efficiently unmap. > > > > I've been looking at something similar, and I prefer my approach, > > although I'm not nearly as far along with my implementation as you are. > > > > My approach is also to add a vm_flags bit, tentatively called VM_NOTLB. > > The page fault handler refuses to insert any TLB entries into the process > > address space. But follow_page_mask() will return the appropriate struct > > page for it. This should be enough for O_DIRECT accesses to work as > > you'll get the appropriate scatterlists built. > > > > I suspect Boaz has already done a lot of thinking about this and doesn't > > need the explanation, but here's how it looks for anyone following along > > at home: > > > > Process A calls read(). > > Kernel allocates a page cache page for it and calls the filesystem through > > ->readpages (or ->readpage). > > Filesystem calls the managing process to get the data for that page. > > Managing process draws a pentagram and summons Beelzebub (or runs Perl; > > whichever you find more scary). > > Managing process notifies the filesystem that the page is now full of data. > > Filesystem marks the page as being Uptodate and unlocks it. > > Process was waiting on the page lock, wakes up and copies the data from the > > page cache into userspace. read() is complete. > > > > What we're concerned about here is what to do after the managing process > > tells the kernel that the read is complete. Clearly allowing the managing > > process continued access to the page is Bad as the page may be freed by the > > page cache and then reused for something else. Doing a TLB shootdown is > > expensive. So Boaz's approach is to have the process promise that it won't > > have any other thread look at it. My approach is to never allow the page > > to have load/store access from userspace; it can only be passed to other > > system calls. > > This all seems to revolve around the fact that userspace fs server > process needs to copy something into userspace client's buffer, right? > > Instead of playing with memory mappings, why not just tell the kernel > *what* to copy? > > While in theory not as generic, I don't see any real limitations (you > don't actually need the current contents of the buffer in the read > case and vica verse in the write case). > > And we already have an interface for this: splice(2). What am I > missing? What's the killer argument in favor of the above messing > with tlb caches etc, instead of just letting the kernel do the dirty > work. Great question. You're completely right that the question is how to tell the kernel what to copy. The problem is that splice() can only write to the first page of a pipe. So you need one pipe per outstanding request, which can easily turn into thousands of file descriptors. If we enhanced splice() so it could write to any page in a pipe, then I think splice() would be the perfect interface.