Re: fsyncgate and ceph

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 2019-02-13 at 16:39 +0000, Sage Weil wrote:
> On Wed, 13 Feb 2019, Dan van der Ster wrote:
> > On Wed, Feb 13, 2019 at 5:32 PM Brett Niver <bniver@xxxxxxxxxx> wrote:
> > >
> > > point of my question - similar to many other async error reporting
> > > systems - you need to do another fsync -after- you've written
> > > bufferedIO and done the initial fsync that you hope worked, to find
> > > out if it actually worked.
> > > right?
> 
> You only need one fsync(2) after you've doing your buffers writes to 
> ensure the writes are stable and discover any errors.  A second fsync(2) 
> call will be a no-op unless more new data was written.
>
> > One of the main issues they found is that retrying fsync has
> > undefined/misunderstood behaviour.
> > In particular:
> > 
> > Linux < 4.13: fsync() errors can be lost in various ways; also buffers
> > are marked clean after errors, so retrying fsync() can falsely report
> > success and the modified buffer can be thrown away at any time due to
> > memory pressure
> 
> Right.  IIUC the moral of the story is you need to check the return value 
> on *every* call to fsync(2).
> 
> sage
> 

...and make sure that you're using a kernel that tracks the errors
properly (if at all possible).

FWIW, we were not able to backport that code to RHEL7 (because it would
have horribly broken kabi). It doesn't have the more robust error
tracking that newer kernels have. All the more reason to look toward
revving the underlying infrastructure on which ceph is commonly
deployed. ;)

> 
> > 
> > -- Dan
> > 
> > 
> > 
> > >
> > >
> > > On Tue, Feb 12, 2019 at 6:04 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > >
> > > > I'm not sure I understand your question.
> > > >
> > > > If you're doing buffered I/O (i.e., you didn't open with O_DIRECT) then
> > > > any writes done into the kernel may be cached indefinitely, until a
> > > > subsequent, successful fsync() is done. At that point they have been
> > > > committed to stable storage (the definition of which varies depending on
> > > > the backend, of course).
> > > >
> > > > Most of the confusion around this is centered around what happens when
> > > > fsync _fails_. That implies that you wrote data into the pagecache and
> > > > it failed writeback.
> > > >
> > > > POSIX is notoriously vague about what happens to the data you wrote at
> > > > that point, and a lot of applications were written with incorrect
> > > > assumptions about this situation.
> > > >
> > > > We haven't really changed fsync semantics with any of this. The main
> > > > change is that we've tightened up error reporting by the kernel in this
> > > > situation, and attempted to clarify what applications can expect in the
> > > > face of a failed fsync.
> > > >
> > > > -- Jeff
> > > >
> > > > On Tue, 2019-02-12 at 17:51 -0500, Brett Niver wrote:
> > > > > does that imply you have to do another fsync after any fsync after any
> > > > > write in any cached mode?
> > > > >
> > > > > On Tue, Feb 12, 2019 at 4:36 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > > > > On Mon, 2019-02-11 at 11:09 -0800, Gregory Farnum wrote:
> > > > > > > On Mon, Feb 11, 2019 at 6:35 AM Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > > > > > > > On Mon, 11 Feb 2019, Jeff Layton wrote:
> > > > > > > > > On Mon, 2019-02-11 at 09:22 +0100, Dan van der Ster wrote:
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > Does anyone know if ceph and level/rocksdb are already immune to these
> > > > > > > > > > fsync issues discovered by the postgresql devs?
> > > > > > > > > >
> > > > > > > > > >     https://fosdem.org/2019/schedule/event/postgresql_fsync/
> > > > > > > > > >     https://wiki.postgresql.org/wiki/Fsync_Errors
> > > > > > > > > >     https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com
> > > > > > > > > >
> > > > > > > > > > Cheers, Dan
> > > > > > > > >
> > > > > > > > > Great question. I took a brief look at the rocksdb code but wasn't able
> > > > > > > > > to draw a meaningful conclusion there.
> > > > > > > > >
> > > > > > > > > I do see that you can set it up to use O_DIRECT, but it's not clear to
> > > > > > > > > me that propagates fsync errors in a meaningful way if you don't. I'm
> > > > > > > > > also not sure how ceph configures rocksdb to operate here either.
> > > > > > > > >
> > > > > > > > > I think it'd be good to reach out to the rocksdb developers and see
> > > > > > > > > whether they've considered its behavior in the face of a writeback
> > > > > > > > > failure. I'm happy to discuss with them if they have questions about the
> > > > > > > > > kernel's behavior.
> > > > > > > >
> > > > > > > > Looking at the filestore code, I see that WBThrottle isn't checking hte
> > > > > > > > fsync(2) return value!  That's an easy fix (we should assert/panic).
> > > > > > > > Opened
> > > > > > >
> > > > > > > It's not just WBThrottle; the main FileStore code also doesn't check
> > > > > > > when doing init() (not so important) or the replay guards (VERY BAD).
> > > > > > >
> > > > > > > I remember MANY conversations about not only the oddities of
> > > > > > > fdatasync-on-xfs-with-weird-options but also fsync in particular (and
> > > > > > > I was most surprised by the improvements the kernel developers have
> > > > > > > been working on in what I skimmed), so I'm surprised we seem to have a
> > > > > > > vulnerability there... Maybe some of the "improvements" made it worse
> > > > > > > on us since we wrote those? :/
> > > > > > >
> > > > > >
> > > > > > I don't think it has made anything worse.
> > > > > >
> > > > > > What has changed recently is just the realization that these sorts of
> > > > > > errors do occur and that the old Linux kernel code that tracked these
> > > > > > errors was horribly unreliable. Writeback could suffer a transient
> > > > > > failure and you might just never know it.
> > > > > >
> > > > > > What we have now in the kernel is much more robust error reporting such
> > > > > > that when pagecache writeback does fail, that you reliably get an error
> > > > > > on a subsequent fsync. Out of this, we've all come to the realization
> > > > > > that a lot of userland programs have sloppy or misunderstood handling of
> > > > > > errors from fsync.
> > > > > >
> > > > > > This problem hasn't really gotten worse or anything. The underlying
> > > > > > storage is just as (un)reliable as ever. We're just giving more scrutiny
> > > > > > to userland applications and how they handle these sorts of errors.
> > > > > >
> > > > > > > On Mon, Feb 11, 2019 at 7:16 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > > > > > > I think it's possible to keep the rocksdb on a normal filesystem though,
> > > > > > > > and it's not clear to me what happens with rocksdb in that case. If
> > > > > > > > writing to the WAL fails, then it looks like the caller will get an
> > > > > > > > error back at that point (e.g. on rocksdb::Put or the like).
> > > > > > > >
> > > > > > > > What I'm currently unclear on is what happens when the background flush
> > > > > > > > runs to sync the WAL out to the datafiles. If you hit an error at that
> > > > > > > > point, it looks like it'll just log the error, sleep for a bit and
> > > > > > > > then...I can't quite tell if it'll retry to write out the WALs that
> > > > > > > > failed, or whether it just sort of moves on at that point.
> > > > > > > >
> > > > > > > > I would just cc the rocksdb devs here, but they use facebook for this,
> > > > > > > > so I'm not sure how best to loop them into the discussion.
> > > > > > >
> > > > > > > While you can of course use rocksdb on a normal FS in a generic
> > > > > > > situation, I don't think any of our deployment tools for BlueStore
> > > > > > > support it and you can't do so with FileStore or you have other
> > > > > > > problems, so that's happily not a worry for us.
> > > > > > > -Greg
> > > > > >
> > > > > > --
> > > > > > Jeff Layton <jlayton@xxxxxxxxxx>
> > > > > >
> > > >
> > > > --
> > > > Jeff Layton <jlayton@xxxxxxxxxx>
> > > >
> > 
> > 

-- 
Jeff Layton <jlayton@xxxxxxxxxx>




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux