This patchset introduces two new syscalls preadv2 and pwritev2. They are the same syscalls as preadv and pwrite but with a flag argument. Additionally, preadv2 implements an extra RWF_NONBLOCK flag. The RWF_NONBLOCK flag in preadv2 introduces an ability to perform a non-blocking read from regular files in buffered IO mode. This works by only for those filesystems that have data in the page cache. We discussed these changes at this year's LSF/MM summit in Boston. More details on the Samba use case, the numbers, and presentation is available at this link: https://lists.samba.org/archive/samba-technical/2015-March/106290.html Please stayed tune for man pages patches and xfstest patches. They will be sent as In-Reply-To. Latest changes highlight: - Drops RWF_DSYNC from pwritev2, per Christoph and Andrew - Updated man pages - Added tests for this functionality to xfstests, per Dave Chinner - Based on top of 4.1-rc3 - Tests / numbers using samba and a CIFS client FIO engine Forward looking: Christoph committed to sending a separate patch series for the RWF_DSYNC for pwritev2 implementation so it can be evaluated independently. This helps with implementing userspace file servers for protocols that have a per operation sync flag (CIFS). Additionally, Christoph committed to implementing RWF_NONBLOCK for the write case as well (in pwritev2) at a later date. Background: Using a threadpool to emulate non-blocking operations on regular buffered files is a common pattern today (samba, libuv, etc...) Applications split the work between network bound threads (epoll) and IO threadpool. Not every application can use sendfile syscall (TLS / post-processing). This common pattern leads to increased request latency. Latency can be due to additional synchronization between the threads or fast (cached data) request stuck behind slow request (large / uncached data). The preadv2 syscall with RWF_NONBLOCK lets userspace applications bypass enqueuing operation in the threadpool if it's already available in the pagecache. Performance numbers (newer Samba): https://drive.google.com/file/d/0B3maCn0jCvYncndGbXJKbGlhejQ/view?usp=sharing https://docs.google.com/spreadsheets/d/1GGTivi-MfZU0doMzomG4XUo9ioWtRvOGQ5FId042L6s/edit?usp=sharing Performance number (older): Some perf data generated using fio comparing the posix aio engine to a version of the posix AIO engine that attempts to performs "fast" reads before submitting the operations to the queue. This workflow is on ext4 partition on raid0 (test / build-rig.) Simulating our database access patern workload using 16kb read accesses. Our database uses a home-spun posix aio like queue (samba does the same thing.) f1: ~73% rand read over mostly cached data (zipf med-size dataset) f2: ~18% rand read over mostly un-cached data (uniform large-dataset) f3: ~9% seq-read over large dataset before: f1: bw (KB /s): min= 11, max= 9088, per=0.56%, avg=969.54, stdev=827.99 lat (msec) : 50=0.01%, 100=1.06%, 250=5.88%, 500=4.08%, 750=12.48% lat (msec) : 1000=17.27%, 2000=49.86%, >=2000=9.42% f2: bw (KB /s): min= 2, max= 1882, per=0.16%, avg=273.28, stdev=220.26 lat (msec) : 250=5.65%, 500=3.31%, 750=15.64%, 1000=24.59%, 2000=46.56% lat (msec) : >=2000=4.33% f3: bw (KB /s): min= 0, max=265568, per=99.95%, avg=174575.10, stdev=34526.89 lat (usec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.27%, 50=10.82% lat (usec) : 100=50.34%, 250=5.05%, 500=7.12%, 750=6.60%, 1000=4.55% lat (msec) : 2=8.73%, 4=3.49%, 10=1.83%, 20=0.89%, 50=0.22% lat (msec) : 100=0.05%, 250=0.02%, 500=0.01% total: READ: io=102365MB, aggrb=174669KB/s, minb=240KB/s, maxb=173599KB/s, mint=600001msec, maxt=600113msec after (with fast read using preadv2 before submit): f1: bw (KB /s): min= 3, max=14897, per=1.28%, avg=2276.69, stdev=2930.39 lat (usec) : 2=70.63%, 4=0.01% lat (msec) : 250=0.20%, 500=2.26%, 750=1.18%, 2000=0.22%, >=2000=25.53% f2: bw (KB /s): min= 2, max= 2362, per=0.14%, avg=249.83, stdev=222.00 lat (msec) : 250=6.35%, 500=1.78%, 750=9.29%, 1000=20.49%, 2000=52.18% lat (msec) : >=2000=9.99% f3: bw (KB /s): min= 1, max=245448, per=100.00%, avg=177366.50, stdev=35995.60 lat (usec) : 2=64.04%, 4=0.01%, 10=0.01%, 20=0.06%, 50=0.43% lat (usec) : 100=0.20%, 250=1.27%, 500=2.93%, 750=3.93%, 1000=7.35% lat (msec) : 2=14.27%, 4=2.88%, 10=1.54%, 20=0.81%, 50=0.22% lat (msec) : 100=0.05%, 250=0.02% total: READ: io=103941MB, aggrb=177339KB/s, minb=213KB/s, maxb=176375KB/s, mint=600020msec, maxt=600178msec Interpreting the results you can see total bandwidth stays the same but overall request latency is decreased in f1 (random, mostly cached) and f3 (sequential) workloads. There is a slight bump in latency for since it's random data that's unlikely to be cached but we're always trying "fast read". In our application we have starting keeping track of "fast read" hits/misses and for files / requests that have a lot hit ratio we don't do "fast reads" mostly getting rid of extra latency in the uncached cases. In our real world work load we were able to reduce average response time by 20 to 30% (depends on amount of IO done by request). I've performed other benchmarks and I have no observed any perf regressions in any of the normal (old) code paths. Full change log: Version 7 highlight: - Drops RWF_DSYNC from pwritev2, per Christoph and Andrew - Updated man pages - Added tests for this functionality to xfstests, per Dave Chinner - Based on top of 4.1-rc3 - Tests / numbers using samba and a CIFS client FIO engine Version 6 highlight: - Compat syscall flag checks, per. Jeff. - Minor stylistic suggestions. Version 5 highlight: - XFS support for RWF_NONBLOCK. from Christoph. - RWF_DSYNC flag and support for pwritev2, from Christoph. - Implemented compat syscalls, per. Jeff. - Missing nfs, ceph changes from older patchset. Version 4 highlight: - Updated for 3.18-rc1. - Performance data from our application. - First stab at man page with Jeff's help. Patch is in-reply to. RFC Version 3 highlights: - Down to 2 syscalls from 4; can user fp or argument position. - RWF_NONBLOCK value flag is not the same O_NONBLOCK, per Jeff. RFC Version 2 highlights: - Put the flags argument into kiocb (less noise), per. Al Viro - O_DIRECT checking early in the process, per. Jeff Moyer - Resolved duplicate (c&p) code in syscall code, per. Jeff - Included perf data in thread cover letter, per. Jeff - Created a new flag (not O_NONBLOCK) for readv2, perf Jeff I have co-developed these changes with Christoph Hellwig. Christoph Hellwig (1): xfs: add RWF_NONBLOCK support Milosz Tanski (4): vfs: Prepare for adding a new preadv/pwritev with user flags. vfs: Define new syscalls preadv2,pwritev2 x86: wire up preadv2 and pwritev2 vfs: RWF_NONBLOCK flag for preadv2 arch/x86/syscalls/syscall_32.tbl | 2 + arch/x86/syscalls/syscall_64.tbl | 2 + drivers/target/target_core_file.c | 6 +- fs/ceph/file.c | 2 + fs/cifs/file.c | 6 + fs/nfs/file.c | 5 +- fs/nfsd/vfs.c | 4 +- fs/ocfs2/file.c | 6 + fs/pipe.c | 3 +- fs/read_write.c | 229 +++++++++++++++++++++++++++++--------- fs/splice.c | 2 +- fs/xfs/xfs_file.c | 28 ++++- include/linux/aio.h | 2 + include/linux/compat.h | 6 + include/linux/fs.h | 6 +- include/linux/syscalls.h | 6 + include/uapi/asm-generic/unistd.h | 6 +- mm/filemap.c | 23 +++- mm/shmem.c | 4 + 19 files changed, 279 insertions(+), 69 deletions(-) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html