Re: fsyncgate and ceph

Jeff Layton <jlayton@xxxxxxxxxx> · Mon, 11 Feb 2019 10:16:18 -0500

On Mon, 2019-02-11 at 14:33 +0000, Sage Weil wrote:
> On Mon, 11 Feb 2019, Jeff Layton wrote:
> > On Mon, 2019-02-11 at 09:22 +0100, Dan van der Ster wrote:
> > > Hi all,
> > > 
> > > Does anyone know if ceph and level/rocksdb are already immune to these
> > > fsync issues discovered by the postgresql devs?
> > > 
> > >     https://fosdem.org/2019/schedule/event/postgresql_fsync/
> > >     https://wiki.postgresql.org/wiki/Fsync_Errors
> > >     https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com
> > > 
> > > Cheers, Dan
> > 
> > Great question. I took a brief look at the rocksdb code but wasn't able
> > to draw a meaningful conclusion there.
> > 
> > I do see that you can set it up to use O_DIRECT, but it's not clear to
> > me that propagates fsync errors in a meaningful way if you don't. I'm
> > also not sure how ceph configures rocksdb to operate here either.
> > 
> > I think it'd be good to reach out to the rocksdb developers and see
> > whether they've considered its behavior in the face of a writeback
> > failure. I'm happy to discuss with them if they have questions about the
> > kernel's behavior.
> 
> Looking at the filestore code, I see that WBThrottle isn't checking hte 
> fsync(2) return value!  That's an easy fix (we should assert/panic).  
> Opened 
> 
> The bluestore code (os/bluestore/KernelDevice) looks fine (there is a 
> single call to fdatasync(2) and we abort on any error).
> 

Yes, from what I can tell, KernelDevice will contain an open-fd that
exists for the life of the bluestore "mount", and fdatasync is issued
against that periodically. Assuming a modern kernel, that should be
sufficient to detect writeback errors that occur -- even ones that occur
in the context of rocksdb on the device.

I think it's possible to keep the rocksdb on a normal filesystem though,
and it's not clear to me what happens with rocksdb in that case. If
writing to the WAL fails, then it looks like the caller will get an
error back at that point (e.g. on rocksdb::Put or the like).

What I'm currently unclear on is what happens when the background flush
runs to sync the WAL out to the datafiles. If you hit an error at that
point, it looks like it'll just log the error, sleep for a bit and
then...I can't quite tell if it'll retry to write out the WALs that
failed, or whether it just sort of moves on at that point.

I would just cc the rocksdb devs here, but they use facebook for this,
so I'm not sure how best to loop them into the discussion.
-- 
Jeff Layton <jlayton@xxxxxxxxxx>