On Fri, Nov 02, 2018 at 08:37:40AM +1100, Dave Chinner wrote: > On Thu, Nov 01, 2018 at 10:17:26AM -0400, Brian Foster wrote: > > On Thu, Nov 01, 2018 at 10:02:25AM +1100, Dave Chinner wrote: > > > On Wed, Oct 31, 2018 at 10:01:55AM -0400, Brian Foster wrote: > > > > xfs_do_writepage() currently returns errors directly regardless of > > > > whether it is called via ->writepages() or ->writepage(). In the > > > > case of ->writepages(), an xfs_do_writepage() error return breaks > > > > the current writeback sequence in write_cache_pages(). This means > > > > that an integrity writeback (i.e., sync), for example, returns > > > > before all associated pages have been processed. > > > > > > That sounds like a bug in write_cache_pages(). It sends pages > > > one at a time to the writepage context, and it is supposed to > > > iterate the entire range on a data sync operation. If you look at > > > the code, is clearly stopping at the first error. > > > > > > IOWs, every filesystem that uses write_cache_pages() for data > > > integrity writes is broken in the same way. > > > > > > And a quick look as fs specific writepages implementations indicates > > > that ext4_writepages() and ibtrfs's extent_write_cache_pages() have > > > the same issue. > > > > > > > I veered away from this approach initially because I didn't want to risk > > breaking behavior of other filesystems. I can revisit it if more > > appropriate and is broadly safe. > > We should fix problems in the generic code if that's where the > problem lies. You can ignore filesystems with their own custom code, > though, because you are more likely to break them that generic > code... > > > > > This can be problematic in cases like unmount. If the writeback > > > > doesn't process all delalloc pages before unmounting, we end up > > > > reclaiming inodes with non-zero delalloc block counts. In turn, this > > > > breaks block accounting and leaves the fs inconsistent. > > > > > > XFS is probably the only filesystem that leaves detectable state > > > around and then detects and reports it.... > > > > > > > XFS explicitly discards delalloc blocks on such writepage failures > > > > to avoid this problem. This isn't terribly useful if we allow an > > > > integrity writeback to complete (and thus a filesystem to unmount) > > > > without addressing the entire set of dirty pages on an inode. > > > > Therefore, change ->writepage[s]() to track high level error state > > > > in the xfs_writepage_ctx structure and return it from the higher > > > > level operation callout rather than xfs_do_writepage(). This ensures > > > > that write_cache_pages() does not exit prematurely when called via > > > > ->writepages(), but both ->writepage() and ->writepages() still > > > > ultimately return an error for the higher level operation. > > > > > > > > This patch introduces a subtle change in the behavior of background > > > > writeback in the event of persistent errors. The current behavior of > > > > returning an error preempts the background writeback. Writeback > > > > eventually comes around again and repeats the process for a few more > > > > pages (in practice) before it once again fails. This repeats over > > > > and over until the entire set of dirty pages is cleaned. This > > > > behavior results in a somewhat slower stream of "page discard" > > > > errors in the system log and dictates that many repeated fsync calls > > > > may be required before the entire data set is processed and mapping > > > > error consumed. With this change in place, background writeback > > > > executes on as many pages as necessary as if each page writeback > > > > were successful. The pages are cleaned immediately and significantly > > > > more page discard errors can be observed at once. > > > > > > Yeah, this is a good change in behaviour, but I think the > > > implementation is wrong. write_cache_pages() needs to continue > > > iterating the range if WB_SYNC_ALL is set even when errors occur. > > > > > > > The other thing that comes to mind is that any time we have this kind of > > iterative/callback interface, it's usually appropriate to have some kind > > of means for the callback to signal the caller to stop processing. You > > could argue that right now write_cache_pages() works as expected, > > returning an error is that stop mechanism and XFS has the bug as it is > > the only filesystem that depends on completely processing integrity wbs > > for fs correctness (or we're all just using it wrong ;P, I haven't > > investigated enough yet to say one way or the other). > > The comment on wcp error handling is enlightening: > > ret = (*writepage)(page, wbc, data); > if (unlikely(ret)) { > if (ret == AOP_WRITEPAGE_ACTIVATE) { > unlock_page(page); > ret = 0; > } else { > /* > * done_index is set past this page, > * so media errors will not choke > * background writeout for the entire > * file. This has consequences for > * range_cyclic semantics (ie. it may > * not be suitable for data integrity > * writeout). > */ > done_index = page->index + 1; > done = 1; > break; > } > } > > "ie it may not be suitable for data integrity writeout" > The way I read that comment is that the done_index bump may not be suitable for integrity writeback (re: the tradeoff being to not "choke" bg writeback), but the wording is not totally clear. > In a nutshell. The current error handling behaviour is not suitable > for data integrity writeout, as your proposed patch has demonstrated. > > > > OTOH, we could maintain the ability to interrupt the scan with a magic > > error return or some such, that just may require a bit more care and > > attention across the set of existing callers. Thoughts? > > I'm not sure why we'd want "stop on error" for data integrity > writeback given the problems you've pointed out. For WB_SYNC_NONE > writeback, the current behaviour (stop on error) is fine, so i don't > see a big need to make unnecessary API changes here. > Neither am I, for XFS at least, I just wasn't sure if other filesystems depended on or expected this behavior in any way. I'll take a more thorough look, run some broader testing and send a write_cache_pages() patch to a wider audience. BTW, it looks like we already have this AOP_WRITEPAGE_ positive_aop_returns enum for special return codes the callback can use to control the caller. I suppose we could add a new one if some fs does have a particular need to forcibly interrupt a writeback. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx