Re: fsync() errors is unsafe and risks data loss

Andres Freund <andres@xxxxxxxxxxx> · Wed, 11 Apr 2018 19:32:21 -0700

Hi,

On 2018-04-12 10:09:16 +1000, Dave Chinner wrote:
> To pound the broken record: there are many good reasons why Linux
> filesystem developers have said "you should use direct IO" to the PG
> devs each time we have this "the kernel doesn't do <complex things
> PG needs>" discussion.

I personally am on board with doing that. But you also gotta recognize
that an efficient DIO usage is a metric ton of work, and you need a
large amount of differing logic for different platforms. It's just not
realistic to do so for every platform.  Postgres is developed by a small
number of people, isn't VC backed etc. The amount of resources we can
throw at something is fairly limited.  I'm hoping to work on adding
linux DIO support to pg, but I'm sure as hell not going to do be able to
do the same on windows (solaris, hpux, aix, ...) etc.

And there's cases where that just doesn't help at all. Being able to
untar a database from backup / archive / timetravel / whatnot, and then
fsyncing the directory tree to make sure it's actually safe, is really
not an insane idea.  Or even just cp -r ing it, and then starting up a
copy of the database.  What you're saying is that none of that is doable
in a safe way, unless you use special-case DIO using tooling for the
whole operation (or at least tools that fsync carefully without ever
closing a fd, which certainly isn't the case for cp et al).

> In this case, robust IO error reporting is easy with DIO. It's one
> of the reasons most of the high performance database engines are
> either using or moving to non-blocking AIO+DIO (RWF_NOWAIT) and use
> O_DSYNC/RWF_DSYNC for integrity-critical IO dispatch. This is also
> being driven by the availability of high performance, high IOPS
> solid state storage where buffering in RAM to optimise IO patterns
> and throughput provides no real performance benefit.
> 
> Using the AIO+DIO infrastructure ensures errors are reported for the
> specific write that fails at failure time (i.e. in the aio
> completion event for the specific IO), yet high IO throughput can be
> maintained without the application needing it's own threading
> infrastructure to prevent blocking.
> 
> This means the application doesn't have to guess where the write
> error occurred to retry/recover, have to handle async write errors
> on close(), have to use fsync() to gather write IO errors and then
> infer where the IO failure was, or require kernels on every
> supported platform to jump through hoops to try to do exactly the
> right thing in error conditions for everyone in all circumstances at
> all times....

Most of that sounds like a good thing to do, but you got to recognize
that that's a lot of linux specific code.

Greetings,

Andres Freund