Re: fsyncgate and ceph

Sage Weil <sweil@xxxxxxxxxx> · Wed, 13 Feb 2019 16:39:15 +0000 (UTC)

On Wed, 13 Feb 2019, Dan van der Ster wrote:
> On Wed, Feb 13, 2019 at 5:32 PM Brett Niver <bniver@xxxxxxxxxx> wrote:
> >
> > point of my question - similar to many other async error reporting
> > systems - you need to do another fsync -after- you've written
> > bufferedIO and done the initial fsync that you hope worked, to find
> > out if it actually worked.
> > right?

You only need one fsync(2) after you've doing your buffers writes to 
ensure the writes are stable and discover any errors.  A second fsync(2) 
call will be a no-op unless more new data was written.

> One of the main issues they found is that retrying fsync has
> undefined/misunderstood behaviour.
> In particular:
> 
> Linux < 4.13: fsync() errors can be lost in various ways; also buffers
> are marked clean after errors, so retrying fsync() can falsely report
> success and the modified buffer can be thrown away at any time due to
> memory pressure

Right.  IIUC the moral of the story is you need to check the return value 
on *every* call to fsync(2).

sage

> 
> -- Dan
> 
> 
> 
> >
> >
> > On Tue, Feb 12, 2019 at 6:04 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > >
> > > I'm not sure I understand your question.
> > >
> > > If you're doing buffered I/O (i.e., you didn't open with O_DIRECT) then
> > > any writes done into the kernel may be cached indefinitely, until a
> > > subsequent, successful fsync() is done. At that point they have been
> > > committed to stable storage (the definition of which varies depending on
> > > the backend, of course).
> > >
> > > Most of the confusion around this is centered around what happens when
> > > fsync _fails_. That implies that you wrote data into the pagecache and
> > > it failed writeback.
> > >
> > > POSIX is notoriously vague about what happens to the data you wrote at
> > > that point, and a lot of applications were written with incorrect
> > > assumptions about this situation.
> > >
> > > We haven't really changed fsync semantics with any of this. The main
> > > change is that we've tightened up error reporting by the kernel in this
> > > situation, and attempted to clarify what applications can expect in the
> > > face of a failed fsync.
> > >
> > > -- Jeff
> > >
> > > On Tue, 2019-02-12 at 17:51 -0500, Brett Niver wrote:
> > > > does that imply you have to do another fsync after any fsync after any
> > > > write in any cached mode?
> > > >
> > > > On Tue, Feb 12, 2019 at 4:36 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > > > On Mon, 2019-02-11 at 11:09 -0800, Gregory Farnum wrote:
> > > > > > On Mon, Feb 11, 2019 at 6:35 AM Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > > > > > > On Mon, 11 Feb 2019, Jeff Layton wrote:
> > > > > > > > On Mon, 2019-02-11 at 09:22 +0100, Dan van der Ster wrote:
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > Does anyone know if ceph and level/rocksdb are already immune to these
> > > > > > > > > fsync issues discovered by the postgresql devs?
> > > > > > > > >
> > > > > > > > >     https://fosdem.org/2019/schedule/event/postgresql_fsync/
> > > > > > > > >     https://wiki.postgresql.org/wiki/Fsync_Errors
> > > > > > > > >     https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com
> > > > > > > > >
> > > > > > > > > Cheers, Dan
> > > > > > > >
> > > > > > > > Great question. I took a brief look at the rocksdb code but wasn't able
> > > > > > > > to draw a meaningful conclusion there.
> > > > > > > >
> > > > > > > > I do see that you can set it up to use O_DIRECT, but it's not clear to
> > > > > > > > me that propagates fsync errors in a meaningful way if you don't. I'm
> > > > > > > > also not sure how ceph configures rocksdb to operate here either.
> > > > > > > >
> > > > > > > > I think it'd be good to reach out to the rocksdb developers and see
> > > > > > > > whether they've considered its behavior in the face of a writeback
> > > > > > > > failure. I'm happy to discuss with them if they have questions about the
> > > > > > > > kernel's behavior.
> > > > > > >
> > > > > > > Looking at the filestore code, I see that WBThrottle isn't checking hte
> > > > > > > fsync(2) return value!  That's an easy fix (we should assert/panic).
> > > > > > > Opened
> > > > > >
> > > > > > It's not just WBThrottle; the main FileStore code also doesn't check
> > > > > > when doing init() (not so important) or the replay guards (VERY BAD).
> > > > > >
> > > > > > I remember MANY conversations about not only the oddities of
> > > > > > fdatasync-on-xfs-with-weird-options but also fsync in particular (and
> > > > > > I was most surprised by the improvements the kernel developers have
> > > > > > been working on in what I skimmed), so I'm surprised we seem to have a
> > > > > > vulnerability there... Maybe some of the "improvements" made it worse
> > > > > > on us since we wrote those? :/
> > > > > >
> > > > >
> > > > > I don't think it has made anything worse.
> > > > >
> > > > > What has changed recently is just the realization that these sorts of
> > > > > errors do occur and that the old Linux kernel code that tracked these
> > > > > errors was horribly unreliable. Writeback could suffer a transient
> > > > > failure and you might just never know it.
> > > > >
> > > > > What we have now in the kernel is much more robust error reporting such
> > > > > that when pagecache writeback does fail, that you reliably get an error
> > > > > on a subsequent fsync. Out of this, we've all come to the realization
> > > > > that a lot of userland programs have sloppy or misunderstood handling of
> > > > > errors from fsync.
> > > > >
> > > > > This problem hasn't really gotten worse or anything. The underlying
> > > > > storage is just as (un)reliable as ever. We're just giving more scrutiny
> > > > > to userland applications and how they handle these sorts of errors.
> > > > >
> > > > > > On Mon, Feb 11, 2019 at 7:16 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > > > > > I think it's possible to keep the rocksdb on a normal filesystem though,
> > > > > > > and it's not clear to me what happens with rocksdb in that case. If
> > > > > > > writing to the WAL fails, then it looks like the caller will get an
> > > > > > > error back at that point (e.g. on rocksdb::Put or the like).
> > > > > > >
> > > > > > > What I'm currently unclear on is what happens when the background flush
> > > > > > > runs to sync the WAL out to the datafiles. If you hit an error at that
> > > > > > > point, it looks like it'll just log the error, sleep for a bit and
> > > > > > > then...I can't quite tell if it'll retry to write out the WALs that
> > > > > > > failed, or whether it just sort of moves on at that point.
> > > > > > >
> > > > > > > I would just cc the rocksdb devs here, but they use facebook for this,
> > > > > > > so I'm not sure how best to loop them into the discussion.
> > > > > >
> > > > > > While you can of course use rocksdb on a normal FS in a generic
> > > > > > situation, I don't think any of our deployment tools for BlueStore
> > > > > > support it and you can't do so with FileStore or you have other
> > > > > > problems, so that's happily not a worry for us.
> > > > > > -Greg
> > > > >
> > > > > --
> > > > > Jeff Layton <jlayton@xxxxxxxxxx>
> > > > >
> > >
> > > --
> > > Jeff Layton <jlayton@xxxxxxxxxx>
> > >
> 
>