> > My recollection is that rocksdb is always flushing, correct. There are > > conveniently only a handful of writers in rocksdb, the main ones being log > > files and sst files. > > > > We could probably put an assertion in fsync() so ensure that the > > FileWriter buffer is empty and flushed...? > > Thanks for your reply, sage:-) I will do that:-) > > By the way, I've got another question here: > It seems that BlueStore tries to provide some kind of atomic > I/O mechanism in which data and metadata are either both modified or > both untouched. To accomplish this, for modifications whose size is > larger than prefer_defer_size, BlueStore will allocate new space for > the modifications and release the old storage space. I think, in the > long run, a initially contiguous stored file in bluestore could become > scattered if there have been many random modifications to that file. > Actually, this is what we are experiencing in our test clusters. The > consequence is that after some period of random modification, the > sequential read performance of that file is significantly degraded. > Should we make this atomic I/O mechanism optional? It seems that most > hard disk only make sure that a sector is never half-modified, for > which, I think, the deferred I/O is enough. Am I right? Thanks:-) I mean, in the scenario of RBD, since most real hard disk only guarantee that a sector is never half-modified, only providing atomic I/O guarantee for modifications whose are less than or equal to that of a disk sector, which is guaranteed by deferred io, should be enough. So, maybe, this atomic I/O guarantee for large size modifications should be made configurable.