[LSF/MM TOPIC] I/O error handling and fsync()

Kevin Wolf <kwolf@xxxxxxxxxx> · Tue, 10 Jan 2017 17:02:24 +0100

Hi all,

when I mentioned the I/O error handling problem especially with fsync()
we have in QEMU to Christoph Hellwig, he thought it would be great topic
for LSF/MM, so here I am. This came up a few months ago on qemu-devel [1]
and we managed to ignore it for a while, but it's a real and potentially
serious problem, so I think I agree with Christoph that it makes sense
to get it discussed at LSF/MM.

At the heart of it is the semantics of fsync(). A few years ago, fsync()
was fixed to actually flush data to the disk, so we now have a defined
and useful meaning of fsync() as long as all your fsync() calls return
success.

However, as soon as one fsync() call fails, even if the root problem is
solved later (network connection restored, some space freed for thin
provisioned storage, etc.), the state we're in is mostly undefined. As
Ric Wheeler told me back in the qemu-devel discussion, when a writeout
fails, you get an fsync() error returned (once), but the kernel page
cache simply marks the respective page as clean and consequently won't
ever retry the writeout. Instead, it can evict it from the cache even
though it isn't actually consistent with the state on disk, which means
throwing away data that was written by some process.

So if you do another fsync() and it returns success, this doesn't
currently mean that all of the data you wrote is on disk, but if
anything, it's just about the data you wrote after the failed fsync().
This isn't very helpful, to say the least, because you called fsync() in
order to get a consistent state on disk, and you still don't have that.

Essentially this means that once you got a fsync() failure, there is no
hope to recover for the application and it has to stop using the file.

To give some context about my perspective as the maintainer for the QEMU
block subsystem: QEMU has a mode (which is usually enabled in
production) where I/O failure isn't communicated to the guest, which
would probably offline the filesystem, thinking its hard disk has died,
but instead QEMU pauses the VM and allows the administrator to resume
when the problem has been fixed. Often the problem is only temporary,
e.g. a network hiccup when a disk image is stored on NFS, so this is a
quite helpful approach.

When QEMU is told to resume the VM, the request is just resubmitted.
This works fine for read/write, but not so much for fsync, because after
the first failure all bets are off even if a subsequent fsync()
succeeds.

So this is the aspect that directly affects me, even though the problem
is much broader and by far doesn't only affect QEMU.

This leads to a few invidivual points to be discussed:

1. Fix the data corruption problem that follows from the current
   behaviour. Imagine the following scenario:

   Process A writes to some file, calls fsync() and gets a failure. The
   data it wrote is marked clean in the page cache even though it's
   inconsistent with the disk. Process A knows that fsync() fails, so
   maybe it can deal with it, at least by stop using the file.

   Now process B opens the same file, reads the updated data that
   process A wrote, makes some additional changes based on that and
   calls fsync() again.  Now fsync() return success. The data written by
   B is on disk, but the data written by A isn't. Oops, this is data
   corruption, and process B doesn't even know about it because all its
   operations succeeded.

2. Define fsync() semantics that include the state after a failure (this
   probably goes a long way towards fixing 1.).

   The semantics that QEMU uses internally (and which it needs to map)
   is that after a successful flush, all writes to the disk image that
   have successfully completed before the flush was issued are stable on
   disk (no matter whether a previous flush failed).

   A possible adaption to Linux, which considers that unlike QEMU
   images, files can be opened more than once, might be that a
   succeeding fsync() on a file descriptor means that all data that has
   been read or written through this file descriptor is consistent
   between the page cache and the disk (the read part is for avoiding
   the scenario from 1.; it means that fsync flushes data written on a
   different file descriptor if it has been seen by this one; hence, the
   page cache can't contain non-dirty pages which aren't consistent with
   the disk).

3. Actually make fsync() failure recoverable.

   You can implement 2. by making sure that a file descriptor for which
   pages have been thrown away always returns an error and never goes
   back to suceeding (it can't succeed according to the definition of 2.
   because the data that would have to be written out is gone). This is
   already a much better interface, but it doesn't really solve the
   actual problem we have.

   We also need to make sure that after a failed fsync() there is a
   chance to recover. This means that the pages shouldn't be thrown away
   immediately; but at the same time, you probably also don't want to
   keep pages indefinitely when there is a permanent writeout error.
   However, if we can make sure that these pages are only evicted in
   case of actual memory pressure, and only if there are no actually
   clean page to evict, I think a lot would be already won.

   In the common case, you could then recover from a temporary failure,
   but if this state isn't maintainable, at least we get consistent
   fsync() failure telling us that the data is gone.

I think I've summarised most aspects here, but if something is unclear
or you'd like to see some more context, please refer to the qemu-devel
discussion [1] that I mentioned, or feel free to just ask.

Thanks,
Kevin

[1] https://lists.gnu.org/archive/html/qemu-block/2016-04/msg00576.html
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html