Need help tracking down a bug in the bio-FOLL_PIN patches

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Jens, Christoph,

I need some help tracking down a bug in the patches that make the bio using
page pinning or no pinning using iov_iter_extract_pages().  The bug causes
seemingly random memory corruption once the "block: Convert
bio_iov_iter_get_pages to use iov_iter_extract_pages" patch is applied.

The bug was detected by a syzbot special:

	https://lore.kernel.org/r/000000000000b0b3c005f3a09383@xxxxxxxxxx/

The basic test body creates/opens a file, truncates it, opens it again
O_DIRECT and then uses sendfile to copy from the file to itself, causing the
file to extend as it goes.  I've added a reduced testcase below.  Note that
the problem only seems to occur if several instances of the test are run in
parallel.  After a few iterations, random memory corruption start showing up
and I see things like:

syz-direct-send[6095]: segfault at 0 ip 0000000000000000 sp 00007ffc81488b28 error 14 in syz-direct-sendfile[400000+1000] likely on CPU 0 (core 0, socket 0)
Code: Unable to access opcode bytes at 0xffffffffffffffd6.
BUG: Bad rss-counter state mm:00000000d5d20a39 type:MM_FILEPAGES val:2
BUG: Bad rss-counter state mm:00000000d5d20a39 type:MM_ANONPAGES val:2
BUG: non-zero pgtables_bytes on freeing mm: 8192

The bug goes away if the file is not truncated, O_DIRECT is not used or two
different files are used.

I've investigated the splice and iov_iter code and looked at what sendfile()
is doing in this case:

 (1) sendfile creates buffer pages and adds them into a pipe, does an, in this
     case, DIO read into those pages, then calls the fs write_iter to write
     the data to the file.

 (2) iov_iter_extract_pages() does not get refs/pins on the pages extracted
     from an ITER_PIPE iterator - but it shouldn't need to as the pipe holds
     the refs.  These pages are passed to DIO read - this op is synchronous,
     so any bios associated with it should be complete.

 (3) I enabled the page_ref tracepoints and added a page flag to limit it to
     pages allocated by append_pipe().  This shows the buffer pipe pages being
     added and I made it dump the list of them in __bio_release_pages() (which
     I made non-optional in bio_release_pages()).

 (4) I added some extra page_ref_set tracepoints with weird "val" parameters
     to add markers into the log.

 (5) I added a tracepoint to trace the lifetime of a bio struct and a flag to
     turn on the tracing, set when the pageflag added in (3) is seen.  Most of
     the time I can see the bio being destroyed in the correct order with
     regard to the splice code, though occasionally there's a bit missing.

 (6) Substituting a fixed preallocated page for the page coming out of the
     pipe in iter_file_splice_write() doesn't get rid of the bug:

	-	array[n].bv_page = buf->page;
	+	array[n].bv_page = splice_tmp;

 (7) Getting an extra ref on the buffer pipe page and deliberately leaking it
     gets rid of the problem.

 (8) Substituting a fixed preallocated page for the page sent to the DIO read
     in iov_iter_extract_pipe_pages() gets rid of the problem.  The pages
     going through the pipe seem to passed to write_iter with no issues.

 (9) I've tried instrumenting kmap() and co. to catch debug-marked pages being
     accessed after they've been released, but didn't see anything.  This
     might not catch if DMA is doing the corrupting.

(10) On the notion that DMA might do the corrupting, I've tried adding a
     permanent ref on the pages, adding them to a list and scanning them
     occasionally - but that doesn't catch anything.

(11) KASAN doesn't spot anything interesting - which might also suggest
     DMA-based corruption.  But since we're dealing with the contents of
     pages, not the page structs themselves (I think), I'm not sure kasan
     would see spot anything.

I'm wondering if the apparent interaction with sendfile/splice is actually a
red herring and that the page turnover that that induces is having an effect.

One thing I don't see is how commenting out ftruncate() should cause the
problem to go away if it's something to do with the splice buffer pipe -
though I guess ftruncate() would release a bunch of pages.

Here's an excerpt from a trace of something I'd expect to see:

   page_ref_set: pfn=0x10e38c flags=debug_mark count=1 mapcount=0 mapping=0 mt=0 val=777
   page_ref_set: pfn=0x10e38c flags=debug_mark count=1 mapcount=0 mapping=0 mt=0 val=666
   bio: bio=00038d84 ADD-PG I=10e38c
   bio: bio=00038d84 END-IO I=0
   page_ref_set: pfn=0x10e38c flags=debug_mark count=1 mapcount=0 mapping=0 mt=0 val=623
   bio: bio=00038d84 UNINIT I=0
   bio_endio: bio=00038d84 iomap_dio_bio_end_io+0x0/0xec
   bio: bio=00038d84 REL-PG I=0
   page_ref_set: pfn=0x10e38c flags=debug_mark count=1 mapcount=0 mapping=0 mt=0 val=980
   bio: bio=00038d84 FREE   I=0
   bio: bio=00038d84 UNINIT I=0
   page_ref_set: pfn=0x10e38c flags=debug_mark count=1 mapcount=0 mapping=0 mt=0 val=888
   page_ref_mod_and_test: pfn=0x10e38c flags=debug_mark count=0 mapcount=0 mapping=0 mt=0 val=-1 ret=1

The weird val=N codes on page_ref_set lines are:
	777 - The page iov_iter_extract_pipe_pages() got from append_pipe()
	666 - __bio_add_page() adding a page
	623 - bio_endio() logging a page
	98n - __bio_release_pages() logging the nth page
	888 - iter_file_splice_write() adding page to array[]

But occasionally I'm seeing something like:

   page_ref_set: pfn=0x1102df flags=debug_mark count=1 mapcount=0 mapping=0000000000000000 mt=0 val=777
   page_ref_set: pfn=0x1102df flags=debug_mark count=1 mapcount=0 mapping=0000000000000000 mt=0 val=666
   bio: bio=0000e514 ADD-PG I=1102df
   page_ref_set: pfn=0x1102df flags=debug_mark count=1 mapcount=0 mapping=0000000000000000 mt=0 val=888
   page_ref_mod_and_test: pfn=0x1102df flags=debug_mark count=0 mapcount=0 mapping=0000000000000000 mt=0 val=-1 ret=1

though I'm not sure why.  Could it be an attempt to read beyond the EOF?  I
don't see the bio being torn down, but the page is passed to
iter_file_splice_write() and released, despite for all I know still with
outstanding I/O pending.  Another possibility is that the bio flag got
cleared.

David
---
#define _GNU_SOURCE 
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/sendfile.h>
#include <sys/wait.h>

#define file_size 0x800
#define send_size 0x1dd00
#define repeat_count 1000

static char buffer[send_size];

int main(int argc, char *argv[])
{
	int in, out, i, wt;

	if (argc != 2 || !argv[1][0]) {
		fprintf(stderr, "Usage: %s <file>\n", argv[0]);
		exit(2);
	}

	for (i = 0; i < repeat_count; i++) {
		switch (fork()) {
		case -1:
			perror("fork");
			exit(1);
		case 0:
			out = creat(argv[1], 0666);
			if (out < 0) {
				perror(argv[1]);
				exit(1);
			}

			if (ftruncate(out, file_size) < 0) {
				perror("ftruncate");
				exit(1);
			}

			if (lseek(out, 0x200, SEEK_SET) < 0) {
				perror("lseek");
				exit(1);
			}

			in = open(argv[1], O_RDONLY | O_DIRECT | O_NOFOLLOW);
			if (in < 0) {
				perror("open");
				exit(1);
			}

			if (sendfile(out, in, NULL, send_size) < 0) {
				perror("sendfile");
				exit(1);
			}
			exit(0);

		default:
			if (wait(&wt) < 0) {
				perror("wait");
				exit(1);
			}
			break;
		}
	}

	exit(0);
}





[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux