Hi Linus and others, as written in a private mail before, I'm currently trying to make use of IORING_OP_SPLICE in order to get zero copy support in Samba. The most important use cases are 8 Mbytes reads and writes to files. where "memcpy" (at the lowest end copy_user_enhanced_fast_string()) is the obvious performance killer. I have a prototype that offers really great performance avoiding "memcpy" by using splice() (in order to get async IORING_OP_SPLICE). So we have two cases: 1. network -> socket -> splice -> pipe -> splice -> file -> storage 2. storage -> file -> splice -> pipe -> splice -> socket -> network With 1. I guess everything can work reliable, once the pages are created/filled in the socket receive buffer they are used exclusively and they won't be shared on the way to the file. Which means we can be sure that data arrives unmodified in the file(system). But with 2. there's a problem, as the pages from the file, which are spliced into the pipe are still shared without copy on write with the file(system). It means writes to the file after the first splice modify the content of the spliced pages! So the content may change uncontrolled before it reaches the network! I created a little example that demonstrates the problem (see below), it gives the following output:
open(O_TMPFILE) => ffd[3] pwrite(count=PIPE_BUF,ofs=PIPE_BUF) 0x1f sret[4096] pipe() => ret[0] splice(count=PIPE_BUF*2,ofs=0) sret[8192] pwrite(count=PIPE_BUF,ofs=0) 0xf0 sret[4096] pwrite(count=PIPE_BUF,ofs=PIPE_BUF) 0xf0 sret[4096] read(from_pipe, count=PIPE_BUF) sret[4096] memcmp() at ofs=0, expecting 0x00 => ret[240] memcmp() at ofs=0, checking for 0xf0 => ret[0] read(from_pipe, count=PIPE_BUF) sret[4096] memcmp() at ofs=PIPE_BUF, expecting 0x1f => ret[209] memcmp() at ofs=PIPE_BUF, checking for 0xf0 => ret[0]
After reading from the pipe we get the values we have written to the file instead of the values we had at the time of splice. For things like web servers, which mostly serve static content, this isn't a problem, but it is for Samba, when reads and writes may happen within microseconds, before the content is pushed to the network. I'm wondering if there's a possible way out of this, maybe triggered by a new flag passed to splice. I looked through the code and noticed the existence of IOMAP_F_SHARED. Maybe the splice from the page cache to the pipe could set IOMAP_F_SHARED, while incrementing the refcount and the network driver could remove it again when the refcount reaches 1 again. Is there any other way we could archive something like this? In addition and/or as alternative I was thinking about a flag to preadv2() (and IORING_OP_READV) to indicate the use of something like async_memcpy(), instead of doing the copy via the cpu. That in combination with IORING_OP_SENDMSG_ZC would avoid "memcpy" on the cpu. Any hints, remarks and prototype patches are highly welcome. Thanks! metze #define _GNU_SOURCE /* See feature_test_macros(7) */ #include <stdio.h> #include <string.h> #include <unistd.h> #include <fcntl.h> #include <limits.h> int main(void) { int ffd; int pfds[2]; char buf [PIPE_BUF] = {0, }; char buf2 [PIPE_BUF] = {0, }; ssize_t sret; int ret; off_t ofs; memset(buf, 0x1f, PIPE_BUF); ffd = open("/tmp/", O_RDWR | O_TMPFILE, S_IRUSR | S_IWUSR); printf("open(O_TMPFILE) => ffd[%d]\n", ffd); sret = pwrite(ffd, buf, PIPE_BUF, PIPE_BUF); printf("pwrite(count=PIPE_BUF,ofs=PIPE_BUF) 0x1f sret[%zd]\n", sret); ret = pipe(pfds); printf("pipe() => ret[%d]\n", ret); ofs = 0; sret = splice(ffd, &ofs, pfds[1], NULL, PIPE_BUF*2, 0); printf("splice(count=PIPE_BUF*2,ofs=0) sret[%zd]\n", sret); memset(buf, 0xf0, PIPE_BUF); sret = pwrite(ffd, buf, PIPE_BUF, 0); printf("pwrite(count=PIPE_BUF,ofs=0) 0xf0 sret[%zd]\n", sret); sret = pwrite(ffd, buf, PIPE_BUF, PIPE_BUF); printf("pwrite(count=PIPE_BUF,ofs=PIPE_BUF) 0xf0 sret[%zd]\n", sret); sret = read(pfds[0], buf, PIPE_BUF); printf("read(from_pipe, count=PIPE_BUF) sret[%zd]\n", sret); memset(buf2, 0x00, PIPE_BUF); ret = memcmp(buf, buf2, PIPE_BUF); printf("memcmp() at ofs=0, expecting 0x00 => ret[%d]\n", ret); memset(buf2, 0xf0, PIPE_BUF); ret = memcmp(buf, buf2, PIPE_BUF); printf("memcmp() at ofs=0, checking for 0xf0 => ret[%d]\n", ret); sret = read(pfds[0], buf, PIPE_BUF); printf("read(from_pipe, count=PIPE_BUF) sret[%zd]\n", sret); memset(buf2, 0x1f, PIPE_BUF); ret = memcmp(buf, buf2, PIPE_BUF); printf("memcmp() at ofs=PIPE_BUF, expecting 0x1f => ret[%d]\n", ret); memset(buf2, 0xf0, PIPE_BUF); ret = memcmp(buf, buf2, PIPE_BUF); printf("memcmp() at ofs=PIPE_BUF, checking for 0xf0 => ret[%d]\n", ret); return 0; }