Re: fsync() errors is unsafe and risks data loss

Andreas Dilger <adilger@xxxxxxxxx> · Tue, 10 Apr 2018 13:44:48 -0600

On Apr 10, 2018, at 10:50 AM, Joshua D. Drake <jd@xxxxxxxxxxxxxxxxx> wrote:
> 
> -ext4,
> 
> If this is not the appropriate list please point me in the right direction. I am a PostgreSQL contributor and we have come across a reliability problem with writes and fsync(). You can see the thread here:
> 
> https://www.postgresql.org/message-id/flat/20180401002038.GA2211%40paquier.xyz#20180401002038.GA2211@xxxxxxxxxxx
> 
> The tl;dr; in the first message doesn't quite describe the problem as we started to dig into it further.

Yes, this is a very long thread.  The summary is Postgres is unhappy that
fsync() on Linux (and also other OSes) returns an error once if there was
a prior write() failure, instead of keeping dirty pages in memory forever
and trying to rewrite them.

This behaviour has existed on Linux forever, and (for better or worse) is
the only reasonable behaviour that the kernel can take.  I've argued for
the opposite behaviour at times, and some subsystems already do limited
retries before finally giving up on a failed write, though there are also
times when retrying at lower levels is pointless if a higher level of
code can handle the failure (e.g. mirrored block devices, filesystem data
mirroring, userspace data mirroring, or cross-node replication).

The confusion is whether fsync() is a "level" state (return error forever
if there were pages that could not be written), or an "edge" state (return
error only for any write failures since the previous fsync() call).

I think Anthony Iliopoulos was pretty clear in his multiple descriptions
in that thread of why the current behaviour is needed (OOM of the whole
system if dirty pages are kept around forever), but many others were stuck
on "I can't believe this is happening??? This is totally unacceptable and
every kernel needs to change to match my expectations!!!" without looking
at the larger picture of what is practical to change and where the issue
should best be fixed.

Regardless of why this is the case, the net is that PG needs to deal with
all of the systems that currently exist that have this behaviour, even if
some day in the future it may change (though that is unlikely).  It seems
ironic that "keep dirty pages in userspace until fsync() returns success"
is totally unacceptable, but "keep dirty pages in the kernel" is fine.
My (limited) understanding of databases was that they preferred to cache
everything in userspace and use O_DIRECT to write to disk (which returns
an error immediately if the write fails and does not double buffer data).

Cheers, Andreas

Attachment:
signature.asc

Description: Message signed with OpenPGP