Re: fsyncgate and ceph

Jeff Layton <jlayton@xxxxxxxxxx> · Tue, 12 Feb 2019 18:04:23 -0500

I'm not sure I understand your question.

If you're doing buffered I/O (i.e., you didn't open with O_DIRECT) then
any writes done into the kernel may be cached indefinitely, until a
subsequent, successful fsync() is done. At that point they have been
committed to stable storage (the definition of which varies depending on
the backend, of course).

Most of the confusion around this is centered around what happens when
fsync _fails_. That implies that you wrote data into the pagecache and
it failed writeback.

POSIX is notoriously vague about what happens to the data you wrote at
that point, and a lot of applications were written with incorrect
assumptions about this situation.

We haven't really changed fsync semantics with any of this. The main
change is that we've tightened up error reporting by the kernel in this
situation, and attempted to clarify what applications can expect in the
face of a failed fsync.

-- Jeff

On Tue, 2019-02-12 at 17:51 -0500, Brett Niver wrote:
> does that imply you have to do another fsync after any fsync after any
> write in any cached mode?
> 
> On Tue, Feb 12, 2019 at 4:36 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > On Mon, 2019-02-11 at 11:09 -0800, Gregory Farnum wrote:
> > > On Mon, Feb 11, 2019 at 6:35 AM Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > > > On Mon, 11 Feb 2019, Jeff Layton wrote:
> > > > > On Mon, 2019-02-11 at 09:22 +0100, Dan van der Ster wrote:
> > > > > > Hi all,
> > > > > > 
> > > > > > Does anyone know if ceph and level/rocksdb are already immune to these
> > > > > > fsync issues discovered by the postgresql devs?
> > > > > > 
> > > > > >     https://fosdem.org/2019/schedule/event/postgresql_fsync/
> > > > > >     https://wiki.postgresql.org/wiki/Fsync_Errors
> > > > > >     https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com
> > > > > > 
> > > > > > Cheers, Dan
> > > > > 
> > > > > Great question. I took a brief look at the rocksdb code but wasn't able
> > > > > to draw a meaningful conclusion there.
> > > > > 
> > > > > I do see that you can set it up to use O_DIRECT, but it's not clear to
> > > > > me that propagates fsync errors in a meaningful way if you don't. I'm
> > > > > also not sure how ceph configures rocksdb to operate here either.
> > > > > 
> > > > > I think it'd be good to reach out to the rocksdb developers and see
> > > > > whether they've considered its behavior in the face of a writeback
> > > > > failure. I'm happy to discuss with them if they have questions about the
> > > > > kernel's behavior.
> > > > 
> > > > Looking at the filestore code, I see that WBThrottle isn't checking hte
> > > > fsync(2) return value!  That's an easy fix (we should assert/panic).
> > > > Opened
> > > 
> > > It's not just WBThrottle; the main FileStore code also doesn't check
> > > when doing init() (not so important) or the replay guards (VERY BAD).
> > > 
> > > I remember MANY conversations about not only the oddities of
> > > fdatasync-on-xfs-with-weird-options but also fsync in particular (and
> > > I was most surprised by the improvements the kernel developers have
> > > been working on in what I skimmed), so I'm surprised we seem to have a
> > > vulnerability there... Maybe some of the "improvements" made it worse
> > > on us since we wrote those? :/
> > > 
> > 
> > I don't think it has made anything worse.
> > 
> > What has changed recently is just the realization that these sorts of
> > errors do occur and that the old Linux kernel code that tracked these
> > errors was horribly unreliable. Writeback could suffer a transient
> > failure and you might just never know it.
> > 
> > What we have now in the kernel is much more robust error reporting such
> > that when pagecache writeback does fail, that you reliably get an error
> > on a subsequent fsync. Out of this, we've all come to the realization
> > that a lot of userland programs have sloppy or misunderstood handling of
> > errors from fsync.
> > 
> > This problem hasn't really gotten worse or anything. The underlying
> > storage is just as (un)reliable as ever. We're just giving more scrutiny
> > to userland applications and how they handle these sorts of errors.
> > 
> > > On Mon, Feb 11, 2019 at 7:16 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > > I think it's possible to keep the rocksdb on a normal filesystem though,
> > > > and it's not clear to me what happens with rocksdb in that case. If
> > > > writing to the WAL fails, then it looks like the caller will get an
> > > > error back at that point (e.g. on rocksdb::Put or the like).
> > > > 
> > > > What I'm currently unclear on is what happens when the background flush
> > > > runs to sync the WAL out to the datafiles. If you hit an error at that
> > > > point, it looks like it'll just log the error, sleep for a bit and
> > > > then...I can't quite tell if it'll retry to write out the WALs that
> > > > failed, or whether it just sort of moves on at that point.
> > > > 
> > > > I would just cc the rocksdb devs here, but they use facebook for this,
> > > > so I'm not sure how best to loop them into the discussion.
> > > 
> > > While you can of course use rocksdb on a normal FS in a generic
> > > situation, I don't think any of our deployment tools for BlueStore
> > > support it and you can't do so with FileStore or you have other
> > > problems, so that's happily not a worry for us.
> > > -Greg
> > 
> > --
> > Jeff Layton <jlayton@xxxxxxxxxx>
> > 

-- 
Jeff Layton <jlayton@xxxxxxxxxx>