On Fri, Apr 3, 2015 at 11:42 PM, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: > On Mon, 30 Mar 2015 13:26:25 -0700 Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: > >> d) fincore() is more expensive > > Actually, I kinda take that back. fincore() will be faster than > preadv2() in the case of a pagecache miss, and slower in the case of a > pagecache hit. > > The breakpoint appears to be a hit rate of 30% - if fewer than 30% of > queries find the page in pagecache, fincore() will be faster than > preadv2(). In my application (motivation for this patch), web-serving applications (familiar to me), and Samba I'm going to that the majority of requests are going to be cached. Only some small percentage will be uncached (say 20%). I'll add to that: a small percentage but of a large number. A lot of IO falls into zipfan / sequential pattern. And that makes send to me a small number of frequently access files and large streaming data (with read ahead). > > This is because for a pagecache miss, fincore() will be about twice as > fast as preadv2(). For a pagecache hit, fincore()+pread() is 55% > slower than preadv2(). If there are lots of misses, fincore() is > faster overall. > > > > > Minimal fincore() implementation is below. It doesn't implement the > page_map!=NULL mode at all and will be slow for large areas - it needs > to be taught about radix_tree_for_each_*(). But it's good enough for > testing. I'm glad you took the time to do this. It's simple, but your implementation is much cleaner then the last round of fincore() from 3 years back. > > On a slow machine, in nanoseconds: > > null syscall: 528 > fincore (miss): 674 > fincore (hit): 729 > single byte pread: 1026 > single byte preadv: 1134 I'm not surprised, fincore() doesn't have to go through all the vfs / fs machinery that pread or preadv do. By chance if you compare pread / preadv with a larger read (say 4k) is the difference negligible. > > pread() is a bit faster than preadv() and samba uses pread(), so the > implementations are: > > if (fincore(fd, NULL, offset, len) == len) > pread(); > else > punt(); > > if (preadv2(fd, ..., offset, len) == len) > ... > else > punt(); > > fincore+pread, pagecache-hit: 1755ns > fincore+pread, pagecache-miss: 674ns > preadv(): 1134ns (preadv2() will be a little faster for misses) > > > > Now, a pagecache hit rate of 30% sounds high so one would think that > fincore+pread is clearly ahead. But the pagecache hit rate in this > code will actually be quite high, because of readahead. > > For a large linear read of a file which is perfectly laid out on disk > and is fully *uncached*, the hit rates will be as good as 99.8%, > because readahead is bringing in data in 2MB blobs. > > In practice I expect that fincore()+pread() will be slower for linear > reads of medium to large files and faster for small files and seeky > accesses. > > How much does all this matter? Not much. On a fast machine a > single-byte pread() takes 240ns. So if your server thread is handling > 25000 requests/sec, we're only talking 0.6% overhead. > > Note that we can trivially monitor the hit rate with either preadv2() > or fincore()+pread(): just count how many times all the data is there > versus how many times it isn't. > > > > Also, note that we can use *both* fincore() and preadv2() to detect the > problematic page-just-disappeared race: > > if (fincore(fd, NULL, offset, len) == len) { > if (preadv2(fd, offset, len) != len) > printf("race just happened"); > > It would be great if someone could apply the below, modify the > preadv2() callsite as above and determine under what conditions (if > any) the page-stealing race occurs. > > Let me see what I can do. > > arch/x86/syscalls/syscall_64.tbl | 1 > include/linux/syscalls.h | 2 > mm/Makefile | 2 > mm/fincore.c | 65 +++++++++++++++++++++++++++++ > 4 files changed, 69 insertions(+), 1 deletion(-) > > diff -puN arch/x86/syscalls/syscall_64.tbl~fincore arch/x86/syscalls/syscall_64.tbl > --- a/arch/x86/syscalls/syscall_64.tbl~fincore > +++ a/arch/x86/syscalls/syscall_64.tbl > @@ -331,6 +331,7 @@ > 322 64 execveat stub_execveat > 323 64 preadv2 sys_preadv2 > 324 64 pwritev2 sys_pwritev2 > +325 common fincore sys_fincore > > # > # x32-specific system call numbers start at 512 to avoid cache impact > diff -puN include/linux/syscalls.h~fincore include/linux/syscalls.h > --- a/include/linux/syscalls.h~fincore > +++ a/include/linux/syscalls.h > @@ -880,6 +880,8 @@ asmlinkage long sys_process_vm_writev(pi > asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type, > unsigned long idx1, unsigned long idx2); > asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags); > +asmlinkage long sys_fincore(int fd, unsigned char __user *page_map, > + loff_t offset, size_t len); > asmlinkage long sys_seccomp(unsigned int op, unsigned int flags, > const char __user *uargs); > asmlinkage long sys_getrandom(char __user *buf, size_t count, > diff -puN mm/Makefile~fincore mm/Makefile > --- a/mm/Makefile~fincore > +++ a/mm/Makefile > @@ -19,7 +19,7 @@ obj-y := filemap.o mempool.o oom_kill. > readahead.o swap.o truncate.o vmscan.o shmem.o \ > util.o mmzone.o vmstat.o backing-dev.o \ > mm_init.o mmu_context.o percpu.o slab_common.o \ > - compaction.o vmacache.o \ > + compaction.o vmacache.o fincore.o \ > interval_tree.o list_lru.o workingset.o \ > debug.o $(mmu-y) > > diff -puN /dev/null mm/fincore.c > --- /dev/null > +++ a/mm/fincore.c > @@ -0,0 +1,65 @@ > +#include <linux/syscalls.h> > +#include <linux/pagemap.h> > +#include <linux/file.h> > +#include <linux/fs.h> > +#include <linux/mm.h> > +#include <linux/slab.h> > +#include <linux/hugetlb.h> > + > +SYSCALL_DEFINE4(fincore, int, fd, unsigned char __user *, page_map, > + loff_t, offset, size_t, len) > +{ > + struct fd f; > + struct address_space *mapping; > + loff_t cur_off; > + loff_t end; > + pgoff_t pgoff; > + long ret = 0; > + > + if (offset < 0 || (ssize_t)len <= 0) > + return -EINVAL; > + > + f = fdget(fd); > + > + if (!f.file) > + return -EBADF; > + > + if (is_file_hugepages(f.file)) { > + ret = -EINVAL; > + goto out; > + } > + > + if (!S_ISREG(file_inode(f.file)->i_mode)) { > + ret = -EBADF; > + goto out; > + } > + > + end = min_t(loff_t, offset + len, i_size_read(file_inode(f.file))); > + pgoff = offset >> PAGE_CACHE_SHIFT; > + mapping = f.file->f_mapping; > + > + /* > + * We probably need to do somethnig here to reduce the chance of the > + * pages being reclaimed between fincore() and read(). eg, > + * SetPageReferenced(page) or mark_page_accessed(page) or > + * activate_page(page). > + */ > + for (cur_off = offset; cur_off < end ; ) { > + struct page *page; > + loff_t end_of_coverage; > + > + page = find_get_page(mapping, pgoff); > + if (!page || !PageUptodate(page)) > + break; > + page_cache_release(page); > + > + pgoff++; > + end_of_coverage = min_t(loff_t, pgoff << PAGE_CACHE_SHIFT, end); > + ret += end_of_coverage - cur_off; > + cur_off = (cur_off + PAGE_CACHE_SIZE) & PAGE_CACHE_MASK; > + } > + > +out: > + fdput(f); > + return ret; > +} > _ > -- Milosz Tanski CTO 16 East 34th Street, 15th floor New York, NY 10016 p: 646-253-9055 e: milosz@xxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-arch" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html