Re: fsync() errors is unsafe and risks data loss

"Theodore Y. Ts'o" <tytso@xxxxxxx> · Thu, 12 Apr 2018 17:52:52 -0400

On Thu, Apr 12, 2018 at 12:55:36PM -0700, Andres Freund wrote:
> 
> Any pointers to that the underling netlink mechanism? If we can force
> postgres to kill itself when such an error is detected (via a dedicated
> monitoring process), I'd personally be happy enough.  It'd be nicer if
> we could associate that knowledge with particular filesystems etc
> (which'd possibly hard through dm etc?), but this'd be much better than
> nothing.

Yeah, sorry, it never got upstreamed.  It's not really all that
complicated, it was just that there were some other folks who wanted
to do something similar, and there was a round of bike-sheddingh
several years ago, and nothing ever went upstream.  Part of the
problem was that our orignial scheme sent up information about file
system-level corruption reports --- e.g, those stemming from calls to
ext4_error() --- and lots of people had different ideas about how tot
get all of the possible information up in some structured format.
(Think something like uerf from Digtial's OSF/1.)

We did something *really* simple/stupid.  We just sent essentially an
ascii test string out the netlink socket.  That's because what we were
doing before was essentially scraping the output of dmesg
(e.g. /dev/kmssg).

That's actually probably the simplest thing to do, and it has the
advantage that it will work even on ancient enterprise kernels that PG
users are likely to want to use.  So you will need to implement the
dmesg text scraper anyway, and that's probably good enough for most
use cases.

> The problem really isn't about *recovering* from disk errors. *Knowing*
> about them is the crucial part. We do not want to give back clients the
> information that an operation succeeded, when it actually didn't. There
> could be improvements above that, but as long as it's guaranteed that
> "we" get the error (rather than just some kernel log we don't have
> access to, which looks different due to config etc), it's ok. We can
> throw our hands up in the air and give up.

Right, it's a little challenging because the actual regexp's you would
need to use do vary from device driver to device driver.  Fortunately
nearly everything is a SCSI/SATA device these days, so there isn't
_that_ much variability.

> Yea, agreed on all that. I don't think anybody actually involved in
> postgres wants to do anything like that. Seems far outside of postgres'
> remit.

Some people on the pg-hackers list were talking about wanting to retry
the fsync() and hoping that would cause the write to somehow suceed.
It's *possible* that might help, but it's not likely to be helpful in
my experience.

Cheers,

						- Ted