Hi, Sage I have tried with bluestore_min_alloc_size = 4096 so that the updated writes can always reallocate inito new extents. It avoids double write theoretically, but with high speed device like nvme,it still has performance issue with metadata updating and the bottleneck is apparently in rocksdb. I think the compaction and data organization in rocksdb may affect a lots. It may have lots of works to do with rocksdb and bluefs such as using different compaction strategies and use less number of levels in rocksdb? So any guides about those and what is the future directions on current bluestore performance issue? Regards Ning Yao 2016-06-27 20:31 GMT+08:00 Sage Weil <sage@xxxxxxxxxxxx>: > On Mon, 27 Jun 2016, myoungwon oh wrote: >> Hi, I have questions for bluestore (4K random write case). >> >> So far, we have used NVRAM(PCIe) as journal and SSD (SATA) as data >> disk (filestore). >> Therefore, we got performance gain from NVRAM journal. >> However, current Bluestore design seems that data (4K aligned) is >> written to data disk first, then metadata is written to WAL rocksdb. >> This design can remove “double write” in objectstore, but in our case, >> NVRAM can not be utilized fully. >> >> So, my questions are that >> >> 1. Can bluestore write WAL first as filestore? > > You can do it indirectly with bluestore_min_alloc_size=65536, which will > send anything smaller than this value through the wal path. Please let > us know what effect this has on our latency/performance! > >> 2. If not, using bcache or flashcache for NVRAM on top of SSDs is right >> answer? > > This is also possible, but I expect we'd like to make this work out of the > box if we can! > > sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html