RE: RocksDB tuning

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Tue, 14 Jun 2016 21:17:55 +0000

> -----Original Message-----
> From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> Sent: Tuesday, June 14, 2016 2:09 PM
> To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>
> Cc: Mark Nelson <mnelson@xxxxxxxxxx>; Igor Fedotov
> <ifedotov@xxxxxxxxxxxx>; Somnath Roy <Somnath.Roy@xxxxxxxxxxx>;
> Manavalan Krishnan <Manavalan.Krishnan@xxxxxxxxxxx>; Ceph
> Development <ceph-devel@xxxxxxxxxxxxxxx>
> Subject: RE: RocksDB tuning
> 
> On Tue, 14 Jun 2016, Allen Samuels wrote:
> > For flash what we want to do is leave min_alloc_size at 4K and figure
> > out how to shrink the oNode so that the KV commit fits into a minimal
> > number of writes.
> >
> > There are two obvious things to do w.r.t. shrinking the oNode size:
> >
> > (1) sophisticated encode/decode function. I've talked about this
> > before, hopefully I'll have more time to dig into this shortly.
> >
> > (2) Reducing the stripe size. A larger stripe size tends to improve
> > sequential read/write speeds when the application is doing large I/O
> > operations (less I/O fracturing). It will also reduce metadata size by
> > amortizing the fixed size of an oNode (i.e., the stuff in an oNode
> > that doesn't scale with the object size) across fewer oNodes. Both of
> > these phenomenon provide decreasing benefits as the stripe size
> increases.
> > However, larger oNodes cost more to read/write them for random I/O
> > operations. I believe that for flash, the current default stripe size
> > of 4MB is too large in that the gains for sequential operations are
> > minimal and the penalty on random operations is too large... This
> > believe should be subjected to experimental verification AFTER we've
> > shrunk the oNode using (1). It's also possible that the optimal stripe
> > size (for flash) is HW dependent -- since the variance in performance
> > characteristics between different flash devices can be rather large.
> 
> Agreed on both of these.
> 
> Not mutually exclusive with (3), though: increase blob size via larger
> min_alloc_size.  4K random write benchmark write-amp aside, I still think we
> may end up with an onode size where the lower write latency and half to
> quarter-size lextent/blob map reduces metadata compaction overhead
> enough to offset the larger initial txn sizes.  We'll see when we benchmark.

Yes, we will see. I believe that on flash you'll still lose because you're writing the data twice. But let the benchmarking proceed!

(I'm assuming an oNode diet has already happended).

> 
> sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html