> My recollection is that rocksdb is always flushing, correct. There are > conveniently only a handful of writers in rocksdb, the main ones being log > files and sst files. > > We could probably put an assertion in fsync() so ensure that the > FileWriter buffer is empty and flushed...? Thanks for your reply, sage:-) I will do that:-) By the way, I've got another question here: It seems that BlueStore tries to provide some kind of atomic I/O mechanism in which data and metadata are either both modified or both untouched. To accomplish this, for modifications whose size is larger than prefer_defer_size, BlueStore will allocate new space for the modifications and release the old storage space. I think, in the long run, a initially contiguous stored file in bluestore could become scattered if there have been many random modifications to that file. Actually, this is what we are experiencing in our test clusters. The consequence is that after some period of random modification, the sequential read performance of that file is significantly degraded. Should we make this atomic I/O mechanism optional? It seems that most hard disk only make sure that a sector is never half-modified, for which, I think, the deferred I/O is enough. Am I right? Thanks:-)