Re: fsync() errors is unsafe and risks data loss

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Fri, 13 Apr 2018 07:48:07 -0700

On Tue, Apr 10, 2018 at 03:07:26PM -0700, Andres Freund wrote:
> I don't think that's the full issue. We can deal with the fact that an
> fsync failure is edge-triggered if there's a guarantee that every
> process doing so would get it.  The fact that one needs to have an FD
> open from before any failing writes occurred to get a failure, *THAT'S*
> the big issue.
> 
> Beyond postgres, it's a pretty common approach to do work on a lot of
> files without fsyncing, then iterate over the directory fsync
> everything, and *then* assume you're safe. But unless I severaly
> misunderstand something that'd only be safe if you kept an FD for every
> file open, which isn't realistic for pretty obvious reasons.

While accepting that under memory pressure we can still evict the error
indicators, we can do a better job than we do today.  The current design
of error reporting says that all errors which occurred before you opened
the file descriptor are of no interest to you.  I don't think that's
necessarily true, and it's actually a change of behaviour from before
the errseq work.

Consider Stupid Task A which calls open(), write(), close(), and Smart
Task B which calls open(), write(), fsync(), close() operating on the
same file.  If A goes entirely before B and encounters an error, before
errseq_t, B would see the error from A's write.

If A and B overlap, even a little bit, then B still gets to see A's
error today.  But if writeback happens for A's write before B opens the
file then B will never see the error.

B doesn't want to see historical errors that a previous invocation of
B has already handled, but we know whether *anyone* has seen the error
or not.  So here's a patch which restores the historical behaviour of
seeing old unhandled errors on a fresh file descriptor:

Signed-off-by: Matthew Wilcox <mawilcox@xxxxxxxxxxxxx>

diff --git a/lib/errseq.c b/lib/errseq.c
index df782418b333..093f1fba4ee0 100644
--- a/lib/errseq.c
+++ b/lib/errseq.c
@@ -119,19 +119,11 @@ EXPORT_SYMBOL(errseq_set);
 errseq_t errseq_sample(errseq_t *eseq)
 {
 	errseq_t old = READ_ONCE(*eseq);
-	errseq_t new = old;
 
-	/*
-	 * For the common case of no errors ever having been set, we can skip
-	 * marking the SEEN bit. Once an error has been set, the value will
-	 * never go back to zero.
-	 */
-	if (old != 0) {
-		new |= ERRSEQ_SEEN;
-		if (old != new)
-			cmpxchg(eseq, old, new);
-	}
-	return new;
+	/* If nobody has seen this error yet, then we can be the first. */
+	if (!(old & ERRSEQ_SEEN))
+		old = 0;
+	return old;
 }
 EXPORT_SYMBOL(errseq_sample);