Re: fsync() errors is unsafe and risks data loss

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 12 Apr 2018 10:09:16 +1000

On Wed, Apr 11, 2018 at 03:52:44PM -0600, Andreas Dilger wrote:
> On Apr 10, 2018, at 4:07 PM, Andres Freund <andres@xxxxxxxxxxx> wrote:
> > 2018-04-10 18:43:56 Ted wrote:
> >> So for better or for worse, there has not been as much investment in
> >> buffered I/O and data robustness in the face of exception handling of
> >> storage devices.
> > 
> > That's a bit of a cop out. It's not just databases that care. Even more
> > basic tools like SCM, package managers and editors care whether they can
> > proper responses back from fsync that imply things actually were synced.
> 
> Sure, but it is mostly PG that is doing (IMHO) crazy things like writing
> to thousands(?) of files, closing the file descriptors, then expecting
> fsync() on a newly-opened fd to return a historical error. 

Yeah, this seems like a recipe for disaster, especially on
cross-platform code where every OS platform behaves differently and
almost never to expectation.

And speaking of "behaving differently to expectations", nobody has
mentioned that close() can also return write errors. Hence if you do
write - close - open - fsync the the write error might get reported
on close, not fsync.  IOWs, the assumption that "async writeback
errors will persist across close to open" is fundamentally broken to
begin with. It's even documented as a slient data loss vector in
the close(2) man page:

$ man 2 close
.....
   Dealing with error returns from close()

	  A careful programmer will check the return value of
	  close(), since it is quite possible that  errors  on  a
	  previous  write(2)  operation  are reported  only on the
	  final close() that releases the open file description.
	  Failing to check the return value when closing a file may
	  lead to silent loss of data.  This can especially be
	  observed with NFS and with disk quota.

Yeah, ensuring data integrity in the face of IO errors is a really
hard problem. :/

----

To pound the broken record: there are many good reasons why Linux
filesystem developers have said "you should use direct IO" to the PG
devs each time we have this "the kernel doesn't do <complex things
PG needs>" discussion.

In this case, robust IO error reporting is easy with DIO. It's one
of the reasons most of the high performance database engines are
either using or moving to non-blocking AIO+DIO (RWF_NOWAIT) and use
O_DSYNC/RWF_DSYNC for integrity-critical IO dispatch. This is also
being driven by the availability of high performance, high IOPS
solid state storage where buffering in RAM to optimise IO patterns
and throughput provides no real performance benefit.

Using the AIO+DIO infrastructure ensures errors are reported for the
specific write that fails at failure time (i.e. in the aio
completion event for the specific IO), yet high IO throughput can be
maintained without the application needing it's own threading
infrastructure to prevent blocking.

This means the application doesn't have to guess where the write
error occurred to retry/recover, have to handle async write errors
on close(), have to use fsync() to gather write IO errors and then
infer where the IO failure was, or require kernels on every
supported platform to jump through hoops to try to do exactly the
right thing in error conditions for everyone in all circumstances at
all times....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx