Re: fsyncgate and ceph

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 11 Feb 2019 11:09:14 -0800

On Mon, Feb 11, 2019 at 6:35 AM Sage Weil <sage@xxxxxxxxxxxx> wrote:
>
> On Mon, 11 Feb 2019, Jeff Layton wrote:
> > On Mon, 2019-02-11 at 09:22 +0100, Dan van der Ster wrote:
> > > Hi all,
> > >
> > > Does anyone know if ceph and level/rocksdb are already immune to these
> > > fsync issues discovered by the postgresql devs?
> > >
> > >     https://fosdem.org/2019/schedule/event/postgresql_fsync/
> > >     https://wiki.postgresql.org/wiki/Fsync_Errors
> > >     https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com
> > >
> > > Cheers, Dan
> >
> > Great question. I took a brief look at the rocksdb code but wasn't able
> > to draw a meaningful conclusion there.
> >
> > I do see that you can set it up to use O_DIRECT, but it's not clear to
> > me that propagates fsync errors in a meaningful way if you don't. I'm
> > also not sure how ceph configures rocksdb to operate here either.
> >
> > I think it'd be good to reach out to the rocksdb developers and see
> > whether they've considered its behavior in the face of a writeback
> > failure. I'm happy to discuss with them if they have questions about the
> > kernel's behavior.
>
> Looking at the filestore code, I see that WBThrottle isn't checking hte
> fsync(2) return value!  That's an easy fix (we should assert/panic).
> Opened

It's not just WBThrottle; the main FileStore code also doesn't check
when doing init() (not so important) or the replay guards (VERY BAD).

I remember MANY conversations about not only the oddities of
fdatasync-on-xfs-with-weird-options but also fsync in particular (and
I was most surprised by the improvements the kernel developers have
been working on in what I skimmed), so I'm surprised we seem to have a
vulnerability there... Maybe some of the "improvements" made it worse
on us since we wrote those? :/

On Mon, Feb 11, 2019 at 7:16 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> I think it's possible to keep the rocksdb on a normal filesystem though,
> and it's not clear to me what happens with rocksdb in that case. If
> writing to the WAL fails, then it looks like the caller will get an
> error back at that point (e.g. on rocksdb::Put or the like).
>
> What I'm currently unclear on is what happens when the background flush
> runs to sync the WAL out to the datafiles. If you hit an error at that
> point, it looks like it'll just log the error, sleep for a bit and
> then...I can't quite tell if it'll retry to write out the WALs that
> failed, or whether it just sort of moves on at that point.
>
> I would just cc the rocksdb devs here, but they use facebook for this,
> so I'm not sure how best to loop them into the discussion.

While you can of course use rocksdb on a normal FS in a generic
situation, I don't think any of our deployment tools for BlueStore
support it and you can't do so with FileStore or you have other
problems, so that's happily not a worry for us.
-Greg