Re: RocksDB tuning

Sage Weil <sweil@xxxxxxxxxx> · Tue, 14 Jun 2016 07:17:58 -0400 (EDT)

On Tue, 14 Jun 2016, Igor Fedotov wrote:
> This result are for compression = none and write block size limited to 
> 4K.

I've been thinking more about this and I'm wondering if we should revisit 
the choice to use a min_alloc_size of 4K on flash.  If it's 4K, then a 4K 
write means

 - 4K write (to newly allocated block)
 - bdev flush
 - kv commit (4k-ish?)
 - bdev flush

which puts a 2 write lower bound on latency.  If we have min_alloc_size of 
8K or 16K, then a 4K write is

 - kv commit (4K + 4k-ish)
 - bdev flush
 - [async] 4k write

Fewer bdev flushes, and only marginally more writes to the device.  I 
guess the question is is whether write-amp is really that important for a 
4k workload?

The upside of a larger min_alloc_size is the worst case metadata (onode) 
size is 1/2 or 1/4.  The sequential read cost of a previously 
random-written object will also be better (fewer IOs).

There is probably a case where 4k min_alloc_size is the right choice but 
it feels like we're optimizing for write-amp to the detriment of other 
more important things.  For example, even after we improve the onode 
encoding, it may be that the larger metadata results in more write-amp 
than the WAL for the 4k writes does.

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html