On 04/13/2015 10:27 AM, Sage Weil wrote:
[adding ceph-devel] On Mon, 13 Apr 2015, Chen, Xiaoxi wrote:Hi, Actually I have done the tuning survey on RocksDB when I was updating the RocksDB to newer version and exposed the tuning in ceph.conf. What we need to ensure is the WAL never hit the disk. The rocksdbWe'll always have to pay that 1x write to the log; we just want to make sure it doesn't turn into 2x. I take it you're assuming the log is on an SSD (not disk)?write ahead log is already introduce 1X write, if the data flushed to SST in level 0, that will be 2X, not to mention any further compaction. The tuning that makes the differences are : write_buffer_size max_write_buffer_number min_write_buffer_number_to_merge Say if we have write_buffer_size =512M max_write_buffer_number = 6 min_write_buffer_number_to_merge =2
Attached are tests for a single PCIE ssd with filestore, newstore + fsync + default tunables, newstore+fsync + Xiaoxi's tunables, and also a test using xiaoxi's tunables with fdatasync.
Basically Xioaxi's tunables help, and fdatasync helps a little more (mostly at small IO sizes), but still not enough to get us to beat filestore, though newstore *does* do consistently better than filestore with 4MB writes now.
Mark
Attachment:
newstore_xiaoxi_fdatasync.pdf
Description: Adobe PDF document