On Thu, 10 Oct 2019, Xuehan Xu wrote: > > > My recollection is that rocksdb is always flushing, correct. There are > > > conveniently only a handful of writers in rocksdb, the main ones being log > > > files and sst files. > > > > > > We could probably put an assertion in fsync() so ensure that the > > > FileWriter buffer is empty and flushed...? > > > > Thanks for your reply, sage:-) I will do that:-) > > > > By the way, I've got another question here: > > It seems that BlueStore tries to provide some kind of atomic > > I/O mechanism in which data and metadata are either both modified or > > both untouched. To accomplish this, for modifications whose size is > > larger than prefer_defer_size, BlueStore will allocate new space for > > the modifications and release the old storage space. I think, in the > > long run, a initially contiguous stored file in bluestore could become > > scattered if there have been many random modifications to that file. > > Actually, this is what we are experiencing in our test clusters. The > > consequence is that after some period of random modification, the > > sequential read performance of that file is significantly degraded. > > Should we make this atomic I/O mechanism optional? It seems that most > > hard disk only make sure that a sector is never half-modified, for > > which, I think, the deferred I/O is enough. Am I right? Thanks:-) > > I mean, in the scenario of RBD, since most real hard disk only > guarantee that a sector is never half-modified, only providing atomic > I/O guarantee for modifications whose are less than or equal to that > of a disk sector, which is guaranteed by deferred io, should be > enough. So, maybe, this atomic I/O guarantee for large size > modifications should be made configurable. The OSD needs to record both the data update *and* the metadata associated with it (pg log entry) atomically, so atomic sector updates aren't sufficient. You might try looking at the bluestore_prefer_deferred_size, which will make writes take the deferred IO path. This gets increasingly inefficient the larger the value is, though! If we really find that fragmentation is a problem over the long term, we should make the deep scrub process rewrite the data it has read if/when it is too fragmented. sage