Re: fsync() errors is unsafe and risks data loss

Andres Freund <andres@xxxxxxxxxxx> · Thu, 12 Apr 2018 15:03:59 -0700

Hi,

On 2018-04-12 17:52:52 -0400, Theodore Y. Ts'o wrote:
> We did something *really* simple/stupid.  We just sent essentially an
> ascii test string out the netlink socket.  That's because what we were
> doing before was essentially scraping the output of dmesg
> (e.g. /dev/kmssg).
> 
> That's actually probably the simplest thing to do, and it has the
> advantage that it will work even on ancient enterprise kernels that PG
> users are likely to want to use.  So you will need to implement the
> dmesg text scraper anyway, and that's probably good enough for most
> use cases.

The worst part of that is, as you mention below, needing to handle a lot
of different error message formats. I guess it's reasonable enough if
you control your hardware, but no such luck.

Aren't there quite realistic scenarios where one could miss kmsg style
messages due to it being a ringbuffer?

> Right, it's a little challenging because the actual regexp's you would
> need to use do vary from device driver to device driver.  Fortunately
> nearly everything is a SCSI/SATA device these days, so there isn't
> _that_ much variability.

There's also SAN / NAS type stuff - not all of that presents as a
SCSI/SATA device, right?

> > Yea, agreed on all that. I don't think anybody actually involved in
> > postgres wants to do anything like that. Seems far outside of postgres'
> > remit.
> 
> Some people on the pg-hackers list were talking about wanting to retry
> the fsync() and hoping that would cause the write to somehow suceed.
> It's *possible* that might help, but it's not likely to be helpful in
> my experience.

Depends on the type of error and storage. ENOSPC, especially over NFS,
has some reasonable chances of being cleared up. And for networked block
storage it's also not impossible to think of scenarios where that'd
work for EIO.

But I think besides hope of clearing up itself, it has the advantage
that it trivially can give *some* feedback to the user. The user'll get
back strerror(ENOSPC) with some decent SQL error code, which'll
hopefully cause them to investigate (well, once monitoring detects high
error rates).   It's much nicer for the user to type COMMIT; get an
appropriate error back etc, than if the database just commits suicide.

Greetings,

Andres Freund