On Wed, Oct 31, 2018 at 10:01:55AM -0400, Brian Foster wrote: > xfs_do_writepage() currently returns errors directly regardless of > whether it is called via ->writepages() or ->writepage(). In the > case of ->writepages(), an xfs_do_writepage() error return breaks > the current writeback sequence in write_cache_pages(). This means > that an integrity writeback (i.e., sync), for example, returns > before all associated pages have been processed. That sounds like a bug in write_cache_pages(). It sends pages one at a time to the writepage context, and it is supposed to iterate the entire range on a data sync operation. If you look at the code, is clearly stopping at the first error. IOWs, every filesystem that uses write_cache_pages() for data integrity writes is broken in the same way. And a quick look as fs specific writepages implementations indicates that ext4_writepages() and ibtrfs's extent_write_cache_pages() have the same issue. > This can be problematic in cases like unmount. If the writeback > doesn't process all delalloc pages before unmounting, we end up > reclaiming inodes with non-zero delalloc block counts. In turn, this > breaks block accounting and leaves the fs inconsistent. XFS is probably the only filesystem that leaves detectable state around and then detects and reports it.... > XFS explicitly discards delalloc blocks on such writepage failures > to avoid this problem. This isn't terribly useful if we allow an > integrity writeback to complete (and thus a filesystem to unmount) > without addressing the entire set of dirty pages on an inode. > Therefore, change ->writepage[s]() to track high level error state > in the xfs_writepage_ctx structure and return it from the higher > level operation callout rather than xfs_do_writepage(). This ensures > that write_cache_pages() does not exit prematurely when called via > ->writepages(), but both ->writepage() and ->writepages() still > ultimately return an error for the higher level operation. > > This patch introduces a subtle change in the behavior of background > writeback in the event of persistent errors. The current behavior of > returning an error preempts the background writeback. Writeback > eventually comes around again and repeats the process for a few more > pages (in practice) before it once again fails. This repeats over > and over until the entire set of dirty pages is cleaned. This > behavior results in a somewhat slower stream of "page discard" > errors in the system log and dictates that many repeated fsync calls > may be required before the entire data set is processed and mapping > error consumed. With this change in place, background writeback > executes on as many pages as necessary as if each page writeback > were successful. The pages are cleaned immediately and significantly > more page discard errors can be observed at once. Yeah, this is a good change in behaviour, but I think the implementation is wrong. write_cache_pages() needs to continue iterating the range if WB_SYNC_ALL is set even when errors occur. i.e. the error state should be maintained by write_cache_pages and returned on completion, not require the filesystem to hide errors from wcp in it's own specific writepage structure... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx