Re: RocksDB tuning

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 06/14/2016 06:53 AM, Mark Nelson wrote:


On 06/14/2016 06:17 AM, Sage Weil wrote:
On Tue, 14 Jun 2016, Igor Fedotov wrote:
This result are for compression = none and write block size limited to
4K.

I've been thinking more about this and I'm wondering if we should revisit
the choice to use a min_alloc_size of 4K on flash.  If it's 4K, then a 4K
write means

 - 4K write (to newly allocated block)
 - bdev flush
 - kv commit (4k-ish?)
 - bdev flush

AFAIK these flushes should happen async under the hood (ie almost free)
on devices with proper power loss protection.


which puts a 2 write lower bound on latency.  If we have
min_alloc_size of
8K or 16K, then a 4K write is

 - kv commit (4K + 4k-ish)
 - bdev flush
 - [async] 4k write

Given what I've seen about how rocksdb behaves (even on ramdisk), I
think this is actually going to be worse than above in a lot of cases. I
could be wrong though.  For SSDs that don't have PLP this might be
significantly faster.

Sage pointed out that the smaller min_alloc_size will increase the size of the onode. More than anything this would probably be the reason imho to increase the min_alloc_size (so long as we can keep the extra data from moving out of the WAL).



Fewer bdev flushes, and only marginally more writes to the device.  I
guess the question is is whether write-amp is really that important for a
4k workload?

The upside of a larger min_alloc_size is the worst case metadata (onode)
size is 1/2 or 1/4.  The sequential read cost of a previously
random-written object will also be better (fewer IOs).

There is probably a case where 4k min_alloc_size is the right choice but
it feels like we're optimizing for write-amp to the detriment of other
more important things.  For example, even after we improve the onode
encoding, it may be that the larger metadata results in more write-amp
than the WAL for the 4k writes does.

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux