Hi Jens, Christoph, I need some help tracking down a bug in the patches that make the bio using page pinning or no pinning using iov_iter_extract_pages(). The bug causes seemingly random memory corruption once the "block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages" patch is applied. The bug was detected by a syzbot special: https://lore.kernel.org/r/000000000000b0b3c005f3a09383@xxxxxxxxxx/ The basic test body creates/opens a file, truncates it, opens it again O_DIRECT and then uses sendfile to copy from the file to itself, causing the file to extend as it goes. I've added a reduced testcase below. Note that the problem only seems to occur if several instances of the test are run in parallel. After a few iterations, random memory corruption start showing up and I see things like: syz-direct-send[6095]: segfault at 0 ip 0000000000000000 sp 00007ffc81488b28 error 14 in syz-direct-sendfile[400000+1000] likely on CPU 0 (core 0, socket 0) Code: Unable to access opcode bytes at 0xffffffffffffffd6. BUG: Bad rss-counter state mm:00000000d5d20a39 type:MM_FILEPAGES val:2 BUG: Bad rss-counter state mm:00000000d5d20a39 type:MM_ANONPAGES val:2 BUG: non-zero pgtables_bytes on freeing mm: 8192 The bug goes away if the file is not truncated, O_DIRECT is not used or two different files are used. I've investigated the splice and iov_iter code and looked at what sendfile() is doing in this case: (1) sendfile creates buffer pages and adds them into a pipe, does an, in this case, DIO read into those pages, then calls the fs write_iter to write the data to the file. (2) iov_iter_extract_pages() does not get refs/pins on the pages extracted from an ITER_PIPE iterator - but it shouldn't need to as the pipe holds the refs. These pages are passed to DIO read - this op is synchronous, so any bios associated with it should be complete. (3) I enabled the page_ref tracepoints and added a page flag to limit it to pages allocated by append_pipe(). This shows the buffer pipe pages being added and I made it dump the list of them in __bio_release_pages() (which I made non-optional in bio_release_pages()). (4) I added some extra page_ref_set tracepoints with weird "val" parameters to add markers into the log. (5) I added a tracepoint to trace the lifetime of a bio struct and a flag to turn on the tracing, set when the pageflag added in (3) is seen. Most of the time I can see the bio being destroyed in the correct order with regard to the splice code, though occasionally there's a bit missing. (6) Substituting a fixed preallocated page for the page coming out of the pipe in iter_file_splice_write() doesn't get rid of the bug: - array[n].bv_page = buf->page; + array[n].bv_page = splice_tmp; (7) Getting an extra ref on the buffer pipe page and deliberately leaking it gets rid of the problem. (8) Substituting a fixed preallocated page for the page sent to the DIO read in iov_iter_extract_pipe_pages() gets rid of the problem. The pages going through the pipe seem to passed to write_iter with no issues. (9) I've tried instrumenting kmap() and co. to catch debug-marked pages being accessed after they've been released, but didn't see anything. This might not catch if DMA is doing the corrupting. (10) On the notion that DMA might do the corrupting, I've tried adding a permanent ref on the pages, adding them to a list and scanning them occasionally - but that doesn't catch anything. (11) KASAN doesn't spot anything interesting - which might also suggest DMA-based corruption. But since we're dealing with the contents of pages, not the page structs themselves (I think), I'm not sure kasan would see spot anything. I'm wondering if the apparent interaction with sendfile/splice is actually a red herring and that the page turnover that that induces is having an effect. One thing I don't see is how commenting out ftruncate() should cause the problem to go away if it's something to do with the splice buffer pipe - though I guess ftruncate() would release a bunch of pages. Here's an excerpt from a trace of something I'd expect to see: page_ref_set: pfn=0x10e38c flags=debug_mark count=1 mapcount=0 mapping=0 mt=0 val=777 page_ref_set: pfn=0x10e38c flags=debug_mark count=1 mapcount=0 mapping=0 mt=0 val=666 bio: bio=00038d84 ADD-PG I=10e38c bio: bio=00038d84 END-IO I=0 page_ref_set: pfn=0x10e38c flags=debug_mark count=1 mapcount=0 mapping=0 mt=0 val=623 bio: bio=00038d84 UNINIT I=0 bio_endio: bio=00038d84 iomap_dio_bio_end_io+0x0/0xec bio: bio=00038d84 REL-PG I=0 page_ref_set: pfn=0x10e38c flags=debug_mark count=1 mapcount=0 mapping=0 mt=0 val=980 bio: bio=00038d84 FREE I=0 bio: bio=00038d84 UNINIT I=0 page_ref_set: pfn=0x10e38c flags=debug_mark count=1 mapcount=0 mapping=0 mt=0 val=888 page_ref_mod_and_test: pfn=0x10e38c flags=debug_mark count=0 mapcount=0 mapping=0 mt=0 val=-1 ret=1 The weird val=N codes on page_ref_set lines are: 777 - The page iov_iter_extract_pipe_pages() got from append_pipe() 666 - __bio_add_page() adding a page 623 - bio_endio() logging a page 98n - __bio_release_pages() logging the nth page 888 - iter_file_splice_write() adding page to array[] But occasionally I'm seeing something like: page_ref_set: pfn=0x1102df flags=debug_mark count=1 mapcount=0 mapping=0000000000000000 mt=0 val=777 page_ref_set: pfn=0x1102df flags=debug_mark count=1 mapcount=0 mapping=0000000000000000 mt=0 val=666 bio: bio=0000e514 ADD-PG I=1102df page_ref_set: pfn=0x1102df flags=debug_mark count=1 mapcount=0 mapping=0000000000000000 mt=0 val=888 page_ref_mod_and_test: pfn=0x1102df flags=debug_mark count=0 mapcount=0 mapping=0000000000000000 mt=0 val=-1 ret=1 though I'm not sure why. Could it be an attempt to read beyond the EOF? I don't see the bio being torn down, but the page is passed to iter_file_splice_write() and released, despite for all I know still with outstanding I/O pending. Another possibility is that the bio flag got cleared. David --- #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <sys/sendfile.h> #include <sys/wait.h> #define file_size 0x800 #define send_size 0x1dd00 #define repeat_count 1000 static char buffer[send_size]; int main(int argc, char *argv[]) { int in, out, i, wt; if (argc != 2 || !argv[1][0]) { fprintf(stderr, "Usage: %s <file>\n", argv[0]); exit(2); } for (i = 0; i < repeat_count; i++) { switch (fork()) { case -1: perror("fork"); exit(1); case 0: out = creat(argv[1], 0666); if (out < 0) { perror(argv[1]); exit(1); } if (ftruncate(out, file_size) < 0) { perror("ftruncate"); exit(1); } if (lseek(out, 0x200, SEEK_SET) < 0) { perror("lseek"); exit(1); } in = open(argv[1], O_RDONLY | O_DIRECT | O_NOFOLLOW); if (in < 0) { perror("open"); exit(1); } if (sendfile(out, in, NULL, send_size) < 0) { perror("sendfile"); exit(1); } exit(0); default: if (wait(&wt) < 0) { perror("wait"); exit(1); } break; } } exit(0); }