On Tue, Apr 10, 2018 at 03:07:26PM -0700, Andres Freund wrote: > I don't think that's the full issue. We can deal with the fact that an > fsync failure is edge-triggered if there's a guarantee that every > process doing so would get it. The fact that one needs to have an FD > open from before any failing writes occurred to get a failure, *THAT'S* > the big issue. > > Beyond postgres, it's a pretty common approach to do work on a lot of > files without fsyncing, then iterate over the directory fsync > everything, and *then* assume you're safe. But unless I severaly > misunderstand something that'd only be safe if you kept an FD for every > file open, which isn't realistic for pretty obvious reasons. While accepting that under memory pressure we can still evict the error indicators, we can do a better job than we do today. The current design of error reporting says that all errors which occurred before you opened the file descriptor are of no interest to you. I don't think that's necessarily true, and it's actually a change of behaviour from before the errseq work. Consider Stupid Task A which calls open(), write(), close(), and Smart Task B which calls open(), write(), fsync(), close() operating on the same file. If A goes entirely before B and encounters an error, before errseq_t, B would see the error from A's write. If A and B overlap, even a little bit, then B still gets to see A's error today. But if writeback happens for A's write before B opens the file then B will never see the error. B doesn't want to see historical errors that a previous invocation of B has already handled, but we know whether *anyone* has seen the error or not. So here's a patch which restores the historical behaviour of seeing old unhandled errors on a fresh file descriptor: Signed-off-by: Matthew Wilcox <mawilcox@xxxxxxxxxxxxx> diff --git a/lib/errseq.c b/lib/errseq.c index df782418b333..093f1fba4ee0 100644 --- a/lib/errseq.c +++ b/lib/errseq.c @@ -119,19 +119,11 @@ EXPORT_SYMBOL(errseq_set); errseq_t errseq_sample(errseq_t *eseq) { errseq_t old = READ_ONCE(*eseq); - errseq_t new = old; - /* - * For the common case of no errors ever having been set, we can skip - * marking the SEEN bit. Once an error has been set, the value will - * never go back to zero. - */ - if (old != 0) { - new |= ERRSEQ_SEEN; - if (old != new) - cmpxchg(eseq, old, new); - } - return new; + /* If nobody has seen this error yet, then we can be the first. */ + if (!(old & ERRSEQ_SEEN)) + old = 0; + return old; } EXPORT_SYMBOL(errseq_sample);