On Sat, Apr 14, 2018 at 11:47:52AM +1000, Dave Chinner wrote: > On Fri, Apr 13, 2018 at 07:02:32AM -0700, Matthew Wilcox wrote: > > 1. If we get an error while wbc->for_background is true, we should not clear > > uptodate on the page, rather SetPageError and SetPageDirty. > > So you're saying we should treat it as a transient error rather than > a permanent error. Yes, I'm proposing leaving the data in memory in case the user wants to try writing it somewhere else. > > 2. Background writebacks should skip pages which are PageError. > > That seems decidedly dodgy in the case where there is a transient > error - it requires a user to specifically run sync to get the data > to disk after the transient error has occurred. Say they don't > notice the problem because it's fleeting and doesn't cause any > obvious problems? That's fair. What I want to avoid is triggering the same error every 30 seconds (or whatever the periodic writeback threshold is set to). > e.g. XFS gets to enospc, runs out of reserve pool blocks so can't > allocate space to write back the page, then space is freed up a few > seconds later and so the next write will work just fine. > > This is a recipe for "I lost data that I wrote /days/ before the > system crashed" bug reports. So ... exponential backoff on retries? > > 3. for_sync writebacks should attempt one last write. Maybe it'll > > succeed this time. If it does, just ClearPageError. If not, we have > > somebody to report this writeback error to, and ClearPageUptodate. > > Which may well be unmount. Are we really going to wait until unmount > to report fatal errors? Goodness, no. The errors would be immediately reportable using the wb_err mechanism, as soon as the first error was encountered.