On Thu, Jun 04, 2020 at 04:50:50PM -0700, Matthew Wilcox wrote: > On Fri, Jun 05, 2020 at 09:30:53AM +1000, Dave Chinner wrote: > > On Thu, Jun 04, 2020 at 04:05:19PM -0700, Matthew Wilcox wrote: > > > On Fri, Jun 05, 2020 at 08:57:26AM +1000, Dave Chinner wrote: > > > > On Thu, Jun 04, 2020 at 01:23:40PM -0700, Matthew Wilcox wrote: > > > > > From: "Matthew Wilcox (Oracle)" <willy@xxxxxxxxxxxxx> > > > > > > > > > > Test generic/019 often results in: > > > > > > > > > > WARNING: at fs/iomap/buffered-io.c:1069 iomap_page_mkwrite_actor+0x57/0x70 > > > > > > > > > > Since this can happen due to a storage error, we should not WARN for it. > > > > > Just return -EIO, which will be converted to a SIGBUS for the hapless > > > > > task attempting to write to the page that we can't read. > > > > > > > > Why didn't the "read" part of the fault which had the EIO error fail > > > > the page fault? i.e. why are we waiting until deep inside the write > > > > fault path to error out on a failed page read? > > > > > > I have a hypothesis that I don't know how to verify. > > > > > > First the task does a load from the page and we put a read-only PTE in > > > the page tables. Then it writes to the page using write(). The page > > > gets written back, but hits an error in iomap_writepage_map() > > > which calls ClearPageUptodate(). Then the task with it mapped attempts > > > to store to it. > > > > Sure, but that's not really what I was asking: why isn't this > > !uptodate state caught before the page fault code calls > > ->page_mkwrite? The page fault code has a reference to the page, > > after all, and in a couple of paths it even has the page locked. > > If there's already a PTE present, then the page fault code doesn't > check the uptodate bit. Here's the path I'm looking at: > > do_wp_page() > -> vm_normal_page() > -> wp_page_shared() > -> do_page_mkwrite() > > I don't see anything in there that checked Uptodate. Yup, exactly the code I was looking at when I asked this question. The kernel has invalidated the contents of a page, yet we still have it mapped into userspace as containing valid contents, and we don't check it at all when userspace generates a protection fault on the page? > > What I'm trying to understand is why this needs to be fixed inside > > ->page_mkwrite, because AFAICT if we have to fix this in the iomap > > code, we have the same "we got handed a bad page by the page fault" > > problem in every single ->page_mkwrite implementation.... > > I think the iomap code is the only filesystem which clears PageUptodate > on errors. I don't think you looked very hard. A quick scan shows at least btrfs, f2fs, hostfs, jffs2, reiserfs, vboxfs and anything using the iomap path will call ClearPageUptodate() on a write IO error. Some of them also call SetPageError(), too. The above fault path doesn't check for that, either. AFAICT, It doesn't check for any adverse page state at all that has occurred since the last time the PTE changed state, and it doesn't check the mapping for error state, either. It just hands ->page_mkwrite() a completely unchecked page. Which again, is kinda what I'm getting at here: why should all the ->page_mkwrite implementations be forced to check that the page handed to it is in a valid state? > I know we've argued about whether that's appropriate or not > in the past. Sure, but that's largely irrelevant here. The fact is that we are taking a fault on a user page marked by the kernel with invalid contents, and it's not being caught by the common page fault code path. Ignore where the page contents are marked invalid, and instead ask "what should a write fault do on a page with invalid contents?". e.g. Maybe it actually read the contents of the page from disk again, rather than have page_mkwrite fail and end up killing the application? Really, all I want is for all filesystems to have the same behaviour in response to the same page state error conditions. Requiring every filesystem to detect and handle the error deep in it's IO path is not going to result in consistent behaviour.... > > > I haven't dug through what generic/019 does, so I don't know how plausible > > > this is. > > > > # Run fsstress and fio(dio/aio and mmap) and simulate disk failure > > # check filesystem consistency at the end. > > > > I don't think it is mixing buffered writes and mmap writes on the > > same file via fio. Maybe fsstress is triggering that, but the > > fsstress workload is single threaded so, again, I'm not sure it will > > do this. > > Maybe that's not how we end up with a read-only PTE in the process's > page tables. Perhaps it starts out with a store, then on an fsync() > we mark it read-only, then try to do another store. True. Normal background writeback will mark pages clean, so it probably doesn't need fsync at all. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx