On Fri 13-04-18 07:48:07, Matthew Wilcox wrote: > On Tue, Apr 10, 2018 at 03:07:26PM -0700, Andres Freund wrote: > > I don't think that's the full issue. We can deal with the fact that an > > fsync failure is edge-triggered if there's a guarantee that every > > process doing so would get it. The fact that one needs to have an FD > > open from before any failing writes occurred to get a failure, *THAT'S* > > the big issue. > > > > Beyond postgres, it's a pretty common approach to do work on a lot of > > files without fsyncing, then iterate over the directory fsync > > everything, and *then* assume you're safe. But unless I severaly > > misunderstand something that'd only be safe if you kept an FD for every > > file open, which isn't realistic for pretty obvious reasons. > > While accepting that under memory pressure we can still evict the error > indicators, we can do a better job than we do today. The current design > of error reporting says that all errors which occurred before you opened > the file descriptor are of no interest to you. I don't think that's > necessarily true, and it's actually a change of behaviour from before > the errseq work. > > Consider Stupid Task A which calls open(), write(), close(), and Smart > Task B which calls open(), write(), fsync(), close() operating on the > same file. If A goes entirely before B and encounters an error, before > errseq_t, B would see the error from A's write. > > If A and B overlap, even a little bit, then B still gets to see A's > error today. But if writeback happens for A's write before B opens the > file then B will never see the error. > > B doesn't want to see historical errors that a previous invocation of > B has already handled, but we know whether *anyone* has seen the error > or not. So here's a patch which restores the historical behaviour of > seeing old unhandled errors on a fresh file descriptor: > > Signed-off-by: Matthew Wilcox <mawilcox@xxxxxxxxxxxxx> So I agree with going to the old semantics of reporting errors from before a file was open at least once to someone. As the PG case shows apps are indeed relying on the old behavior. As much as it is unreliable, it ends up doing the right thing for these apps in 99% of cases and we shouldn't break them (BTW IMO the changelog should contain a note that this fixes a regression of PostgreSQL, a reference to this thread and CC to stable). Anyway feel free to add: Reviewed-by: Jan Kara <jack@xxxxxxx> Oh, and to make myself clear I do think we need to find a better way of reporting IO errors. I consider this just an immediate band-aid to avoid userspace regressions. Honza > diff --git a/lib/errseq.c b/lib/errseq.c > index df782418b333..093f1fba4ee0 100644 > --- a/lib/errseq.c > +++ b/lib/errseq.c > @@ -119,19 +119,11 @@ EXPORT_SYMBOL(errseq_set); > errseq_t errseq_sample(errseq_t *eseq) > { > errseq_t old = READ_ONCE(*eseq); > - errseq_t new = old; > > - /* > - * For the common case of no errors ever having been set, we can skip > - * marking the SEEN bit. Once an error has been set, the value will > - * never go back to zero. > - */ > - if (old != 0) { > - new |= ERRSEQ_SEEN; > - if (old != new) > - cmpxchg(eseq, old, new); > - } > - return new; > + /* If nobody has seen this error yet, then we can be the first. */ > + if (!(old & ERRSEQ_SEEN)) > + old = 0; > + return old; -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR