RE: RocksDB tuning

Sage Weil <sweil@xxxxxxxxxx> · Tue, 14 Jun 2016 17:08:44 -0400 (EDT)

On Tue, 14 Jun 2016, Allen Samuels wrote:
> For flash what we want to do is leave min_alloc_size at 4K and figure 
> out how to shrink the oNode so that the KV commit fits into a minimal 
> number of writes.
> 
> There are two obvious things to do w.r.t. shrinking the oNode size:
> 
> (1) sophisticated encode/decode function. I've talked about this before, 
> hopefully I'll have more time to dig into this shortly.
>
> (2) Reducing the stripe size. A larger stripe size tends to improve 
> sequential read/write speeds when the application is doing large I/O 
> operations (less I/O fracturing). It will also reduce metadata size by 
> amortizing the fixed size of an oNode (i.e., the stuff in an oNode that 
> doesn't scale with the object size) across fewer oNodes. Both of these 
> phenomenon provide decreasing benefits as the stripe size increases. 
> However, larger oNodes cost more to read/write them for random I/O 
> operations. I believe that for flash, the current default stripe size of 
> 4MB is too large in that the gains for sequential operations are minimal 
> and the penalty on random operations is too large... This believe should 
> be subjected to experimental verification AFTER we've shrunk the oNode 
> using (1). It's also possible that the optimal stripe size (for flash) 
> is HW dependent -- since the variance in performance characteristics 
> between different flash devices can be rather large.

Agreed on both of these.

Not mutually exclusive with (3), though: increase blob size via larger 
min_alloc_size.  4K random write benchmark write-amp aside, I still think 
we may end up with an onode size where the lower write latency and half to 
quarter-size lextent/blob map reduces metadata compaction overhead enough 
to offset the larger initial txn sizes.  We'll see when we benchmark.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html