On Mon, 2019-02-11 at 11:09 -0800, Gregory Farnum wrote: > On Mon, Feb 11, 2019 at 6:35 AM Sage Weil <sage@xxxxxxxxxxxx> wrote: > > On Mon, 11 Feb 2019, Jeff Layton wrote: > > > On Mon, 2019-02-11 at 09:22 +0100, Dan van der Ster wrote: > > > > Hi all, > > > > > > > > Does anyone know if ceph and level/rocksdb are already immune to these > > > > fsync issues discovered by the postgresql devs? > > > > > > > > https://fosdem.org/2019/schedule/event/postgresql_fsync/ > > > > https://wiki.postgresql.org/wiki/Fsync_Errors > > > > https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com > > > > > > > > Cheers, Dan > > > > > > Great question. I took a brief look at the rocksdb code but wasn't able > > > to draw a meaningful conclusion there. > > > > > > I do see that you can set it up to use O_DIRECT, but it's not clear to > > > me that propagates fsync errors in a meaningful way if you don't. I'm > > > also not sure how ceph configures rocksdb to operate here either. > > > > > > I think it'd be good to reach out to the rocksdb developers and see > > > whether they've considered its behavior in the face of a writeback > > > failure. I'm happy to discuss with them if they have questions about the > > > kernel's behavior. > > > > Looking at the filestore code, I see that WBThrottle isn't checking hte > > fsync(2) return value! That's an easy fix (we should assert/panic). > > Opened > > It's not just WBThrottle; the main FileStore code also doesn't check > when doing init() (not so important) or the replay guards (VERY BAD). > > I remember MANY conversations about not only the oddities of > fdatasync-on-xfs-with-weird-options but also fsync in particular (and > I was most surprised by the improvements the kernel developers have > been working on in what I skimmed), so I'm surprised we seem to have a > vulnerability there... Maybe some of the "improvements" made it worse > on us since we wrote those? :/ > I don't think it has made anything worse. What has changed recently is just the realization that these sorts of errors do occur and that the old Linux kernel code that tracked these errors was horribly unreliable. Writeback could suffer a transient failure and you might just never know it. What we have now in the kernel is much more robust error reporting such that when pagecache writeback does fail, that you reliably get an error on a subsequent fsync. Out of this, we've all come to the realization that a lot of userland programs have sloppy or misunderstood handling of errors from fsync. This problem hasn't really gotten worse or anything. The underlying storage is just as (un)reliable as ever. We're just giving more scrutiny to userland applications and how they handle these sorts of errors. > On Mon, Feb 11, 2019 at 7:16 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > I think it's possible to keep the rocksdb on a normal filesystem though, > > and it's not clear to me what happens with rocksdb in that case. If > > writing to the WAL fails, then it looks like the caller will get an > > error back at that point (e.g. on rocksdb::Put or the like). > > > > What I'm currently unclear on is what happens when the background flush > > runs to sync the WAL out to the datafiles. If you hit an error at that > > point, it looks like it'll just log the error, sleep for a bit and > > then...I can't quite tell if it'll retry to write out the WALs that > > failed, or whether it just sort of moves on at that point. > > > > I would just cc the rocksdb devs here, but they use facebook for this, > > so I'm not sure how best to loop them into the discussion. > > While you can of course use rocksdb on a normal FS in a generic > situation, I don't think any of our deployment tools for BlueStore > support it and you can't do so with FileStore or you have other > problems, so that's happily not a worry for us. > -Greg -- Jeff Layton <jlayton@xxxxxxxxxx>