Re: Why BlueRocksDirectory::Fsync only sync metadata?

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 10 Oct 2019 12:10:41 +0000 (UTC)

On Thu, 10 Oct 2019, Xuehan Xu wrote:
> > > My recollection is that rocksdb is always flushing, correct.  There are
> > > conveniently only a handful of writers in rocksdb, the main ones being log
> > > files and sst files.
> > >
> > > We could probably put an assertion in fsync() so ensure that the
> > > FileWriter buffer is empty and flushed...?
> >
> > Thanks for your reply, sage:-) I will do that:-)
> >
> > By the way, I've got another question here:
> >        It seems that BlueStore tries to provide some kind of atomic
> > I/O mechanism in which data and metadata are either both modified or
> > both untouched. To accomplish this, for modifications whose size is
> > larger than prefer_defer_size, BlueStore will allocate new space for
> > the modifications and release the old storage space. I think, in the
> > long run, a initially contiguous stored file in bluestore could become
> > scattered if there have been many random modifications to that file.
> > Actually, this is what we are experiencing in our test clusters. The
> > consequence is that after some period of random modification, the
> > sequential read performance of that file is significantly degraded.
> > Should we make this atomic I/O mechanism optional? It seems that most
> > hard disk only make sure that a sector is never half-modified, for
> > which, I think, the deferred I/O is enough. Am I right? Thanks:-)
> 
> I mean, in the scenario of RBD, since most real hard disk only
> guarantee that a sector is never half-modified, only providing atomic
> I/O guarantee for modifications whose are less than or equal to that
> of a disk sector, which is guaranteed by deferred io, should be
> enough. So, maybe, this atomic I/O guarantee for large size
> modifications should be made configurable.

The OSD needs to record both the data update *and* the metadata associated 
with it (pg log entry) atomically, so atomic sector updates aren't 
sufficient.

You might try looking at the bluestore_prefer_deferred_size, which will 
make writes take the deferred IO path.  This gets increasingly inefficient 
the larger the value is, though!

If we really find that fragmentation is a problem over the long term, we 
should make the deep scrub process rewrite the data it has read if/when it 
is too fragmented.

sage