Re: Newstore get_omap_iterator

Mark Nelson <mnelson@xxxxxxxxxx> · Wed, 15 Apr 2015 21:59:54 -0500

On 04/13/2015 10:27 AM, Sage Weil wrote:
[adding ceph-devel]

On Mon, 13 Apr 2015, Chen, Xiaoxi wrote:
Hi,

       Actually I have done the tuning survey on RocksDB when I was
updating the RocksDB to newer version and exposed the tuning in
ceph.conf.

       What we need to ensure is the WAL never hit the disk. The rocksdb

We'll always have to pay that 1x write to the log; we just want to make
sure it doesn't turn into 2x.  I take it you're assuming the log is on an
SSD (not disk)?

write ahead log is already introduce 1X write, if the data flushed to
SST in level 0, that will be 2X, not to mention any further compaction.

       The tuning that makes the differences are :
	write_buffer_size
	max_write_buffer_number
	min_write_buffer_number_to_merge

       Say if we have
	write_buffer_size =512M
	max_write_buffer_number = 6
	min_write_buffer_number_to_merge =2

Attached are tests for a single PCIE ssd with filestore, newstore + 
fsync + default tunables, newstore+fsync + Xiaoxi's tunables, and also a 
test using xiaoxi's tunables with fdatasync.

Basically Xioaxi's tunables help, and fdatasync helps a little more 
(mostly at small IO sizes), but still not enough to get us to beat 
filestore, though newstore *does* do consistently better than filestore 
with 4MB writes now.

Mark

Attachment:
newstore_xiaoxi_fdatasync.pdf

Description: Adobe PDF document