Re: Why BlueRocksDirectory::Fsync only sync metadata?

Xuehan Xu <xxhdx1985126@xxxxxxxxx> · Thu, 10 Oct 2019 15:42:26 +0800

> > My recollection is that rocksdb is always flushing, correct.  There are
> > conveniently only a handful of writers in rocksdb, the main ones being log
> > files and sst files.
> >
> > We could probably put an assertion in fsync() so ensure that the
> > FileWriter buffer is empty and flushed...?
>
> Thanks for your reply, sage:-) I will do that:-)
>
> By the way, I've got another question here:
>        It seems that BlueStore tries to provide some kind of atomic
> I/O mechanism in which data and metadata are either both modified or
> both untouched. To accomplish this, for modifications whose size is
> larger than prefer_defer_size, BlueStore will allocate new space for
> the modifications and release the old storage space. I think, in the
> long run, a initially contiguous stored file in bluestore could become
> scattered if there have been many random modifications to that file.
> Actually, this is what we are experiencing in our test clusters. The
> consequence is that after some period of random modification, the
> sequential read performance of that file is significantly degraded.
> Should we make this atomic I/O mechanism optional? It seems that most
> hard disk only make sure that a sector is never half-modified, for
> which, I think, the deferred I/O is enough. Am I right? Thanks:-)

I mean, in the scenario of RBD, since most real hard disk only
guarantee that a sector is never half-modified, only providing atomic
I/O guarantee for modifications whose are less than or equal to that
of a disk sector, which is guaranteed by deferred io, should be
enough. So, maybe, this atomic I/O guarantee for large size
modifications should be made configurable.
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx