IO error semantics

Nick Piggin <npiggin@xxxxxxx> · Mon, 18 Jan 2010 17:05:18 +1100

On Mon, Jan 18, 2010 at 04:18:47PM +1100, Nick Piggin wrote:
> We also need to remove some ClearPageUptodate calls I think (similar
> issues), so keep those in mind too. Unfortunately it looks like there
> are also a lot of filesystem specific tests of PageUptodate... but you
> could also move those under the new compatibility s_flag.
> 
> I don't know of a really good way to inject and test filesystem errors.
> Make request failures causes most fs to quickly go readonly or have
> bigger problems. If you're careful like try to only fail read IOs for
> data, or only fail write IOs not involved in integrity or journal
> operations, then test programs just tend to abort pretty quickly. Does
> anyone know of anything more systematic?

This might be a good time to bring up IO error behaviour again. I got
into some debates I think on Andi's hwpoison thread a while back, but
probably not appropriate thread to find a real solution to this.

The problem we have now is that IO error semantics are not well defined.
It is hard to even enumerate all the issues.

read IOs
  how to retry? appropriate defaults should happen at the block layer I
  think. Should retry behaviour be tunable by the mm/fs, or should that
  be coded explicitly as submission retry loops? Either way does imply
  there is either similar defaults for all types (or maybe classes) of
  drivers, or some way to query/set this.

  It would be nice to be able to set fs/driver behaviour from userspace
  too, in a generic (not driver or fs specific way). But defaults should
  be reasonable and similar between all, I guess.

write IOs
  This is more interesting. How to handle write IO errors. In my opinion
  we must not invalidate the data before an IO error is returned to
  somebody (whether it be fsync or a synchronous write syscall). Any
  earlier and the app just gets RAW consistency randomly violated. And I
  think it is important to treat IO errors as transparently as possible
  until the error can be detected.

  I happen to think that actually we should go further and not
  invalidate the data at all. This makes implementation simpler, and
  also allows us to retry writes like we can retry reads. It's also
  problematic to throw out errors at that point because *sync syscalls
  coming from elsewhere could result in loss of error reporting (think,
  sys_sync).

  If we go this way, we probably need another syscall and fs helper call
  to invalidate the dirty data when we give up on retries. truncate_range
  probably not appropriate because it is much harder to implement and
  maybe we want to try to get at the most recent data that is on disk.

  Also do we need to think about O_SYNC or -o sync type of writes that
  are implemented via writeback cache? We could invalidate the dirtied
  cache ASAP, which would leave a window where a concurrent read can see
  first new, then old data. It would also kind of break the above scheme
  in case the pagecache was already dirty via a descriptor without
  O_SYNC. It might just make sense to leave the pagecache dirty. Either
  way it should be documented I think.

Do we even care enough to bother thinking about this now? (serious question)

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html