does that imply you have to do another fsync after any fsync after any write in any cached mode? On Tue, Feb 12, 2019 at 4:36 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > On Mon, 2019-02-11 at 11:09 -0800, Gregory Farnum wrote: > > On Mon, Feb 11, 2019 at 6:35 AM Sage Weil <sage@xxxxxxxxxxxx> wrote: > > > On Mon, 11 Feb 2019, Jeff Layton wrote: > > > > On Mon, 2019-02-11 at 09:22 +0100, Dan van der Ster wrote: > > > > > Hi all, > > > > > > > > > > Does anyone know if ceph and level/rocksdb are already immune to these > > > > > fsync issues discovered by the postgresql devs? > > > > > > > > > > https://fosdem.org/2019/schedule/event/postgresql_fsync/ > > > > > https://wiki.postgresql.org/wiki/Fsync_Errors > > > > > https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com > > > > > > > > > > Cheers, Dan > > > > > > > > Great question. I took a brief look at the rocksdb code but wasn't able > > > > to draw a meaningful conclusion there. > > > > > > > > I do see that you can set it up to use O_DIRECT, but it's not clear to > > > > me that propagates fsync errors in a meaningful way if you don't. I'm > > > > also not sure how ceph configures rocksdb to operate here either. > > > > > > > > I think it'd be good to reach out to the rocksdb developers and see > > > > whether they've considered its behavior in the face of a writeback > > > > failure. I'm happy to discuss with them if they have questions about the > > > > kernel's behavior. > > > > > > Looking at the filestore code, I see that WBThrottle isn't checking hte > > > fsync(2) return value! That's an easy fix (we should assert/panic). > > > Opened > > > > It's not just WBThrottle; the main FileStore code also doesn't check > > when doing init() (not so important) or the replay guards (VERY BAD). > > > > I remember MANY conversations about not only the oddities of > > fdatasync-on-xfs-with-weird-options but also fsync in particular (and > > I was most surprised by the improvements the kernel developers have > > been working on in what I skimmed), so I'm surprised we seem to have a > > vulnerability there... Maybe some of the "improvements" made it worse > > on us since we wrote those? :/ > > > > I don't think it has made anything worse. > > What has changed recently is just the realization that these sorts of > errors do occur and that the old Linux kernel code that tracked these > errors was horribly unreliable. Writeback could suffer a transient > failure and you might just never know it. > > What we have now in the kernel is much more robust error reporting such > that when pagecache writeback does fail, that you reliably get an error > on a subsequent fsync. Out of this, we've all come to the realization > that a lot of userland programs have sloppy or misunderstood handling of > errors from fsync. > > This problem hasn't really gotten worse or anything. The underlying > storage is just as (un)reliable as ever. We're just giving more scrutiny > to userland applications and how they handle these sorts of errors. > > > On Mon, Feb 11, 2019 at 7:16 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > > I think it's possible to keep the rocksdb on a normal filesystem though, > > > and it's not clear to me what happens with rocksdb in that case. If > > > writing to the WAL fails, then it looks like the caller will get an > > > error back at that point (e.g. on rocksdb::Put or the like). > > > > > > What I'm currently unclear on is what happens when the background flush > > > runs to sync the WAL out to the datafiles. If you hit an error at that > > > point, it looks like it'll just log the error, sleep for a bit and > > > then...I can't quite tell if it'll retry to write out the WALs that > > > failed, or whether it just sort of moves on at that point. > > > > > > I would just cc the rocksdb devs here, but they use facebook for this, > > > so I'm not sure how best to loop them into the discussion. > > > > While you can of course use rocksdb on a normal FS in a generic > > situation, I don't think any of our deployment tools for BlueStore > > support it and you can't do so with FileStore or you have other > > problems, so that's happily not a worry for us. > > -Greg > > -- > Jeff Layton <jlayton@xxxxxxxxxx> >