RE: RocksDB tuning

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Tue, 14 Jun 2016 14:55:56 +0000

> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@xxxxxxxxxx]
> Sent: Tuesday, June 14, 2016 6:01 AM
> To: Sage Weil <sweil@xxxxxxxxxx>; Igor Fedotov <ifedotov@xxxxxxxxxxxx>
> Cc: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Somnath Roy
> <Somnath.Roy@xxxxxxxxxxx>; Manavalan Krishnan
> <Manavalan.Krishnan@xxxxxxxxxxx>; Ceph Development <ceph-
> devel@xxxxxxxxxxxxxxx>
> Subject: Re: RocksDB tuning
> 
> 
> 
> On 06/14/2016 06:53 AM, Mark Nelson wrote:
> >
> >
> > On 06/14/2016 06:17 AM, Sage Weil wrote:
> >> On Tue, 14 Jun 2016, Igor Fedotov wrote:
> >>> This result are for compression = none and write block size limited
> >>> to 4K.
> >>
> >> I've been thinking more about this and I'm wondering if we should
> >> revisit the choice to use a min_alloc_size of 4K on flash.  If it's
> >> 4K, then a 4K write means
> >>
> >>  - 4K write (to newly allocated block)
> >>  - bdev flush
> >>  - kv commit (4k-ish?)
> >>  - bdev flush
> >
> > AFAIK these flushes should happen async under the hood (ie almost
> > free) on devices with proper power loss protection.
> >
> >>
> >> which puts a 2 write lower bound on latency.  If we have
> >> min_alloc_size of 8K or 16K, then a 4K write is
> >>
> >>  - kv commit (4K + 4k-ish)
> >>  - bdev flush
> >>  - [async] 4k write
> >
> > Given what I've seen about how rocksdb behaves (even on ramdisk), I
> > think this is actually going to be worse than above in a lot of cases.
> > I could be wrong though.  For SSDs that don't have PLP this might be
> > significantly faster.
> 
> Sage pointed out that the smaller min_alloc_size will increase the size of the
> onode.  More than anything this would probably be the reason imho to
> increase the min_alloc_size (so long as we can keep the extra data from
> moving out of the WAL).

For flash what we want to do is leave min_alloc_size at 4K and figure out how to shrink the oNode so that the KV commit fits into a minimal number of writes.

There are two obvious things to do w.r.t. shrinking the oNode size:

(1) sophisticated encode/decode function. I've talked about this before, hopefully I'll have more time to dig into this shortly.
(2) Reducing the stripe size. A larger stripe size tends to improve sequential read/write speeds when the application is doing large I/O operations (less I/O fracturing). It will also reduce metadata size by amortizing the fixed size of an oNode (i.e., the stuff in an oNode that doesn't scale with the object size) across fewer oNodes. Both of these phenomenon provide decreasing benefits as the stripe size increases. However, larger oNodes cost more to read/write them for random I/O operations. I believe that for flash, the current default stripe size of 4MB is too large in that the gains for sequential operations are minimal and the penalty on random operations is too large... This believe should be subjected to experimental verification AFTER we've shrunk the oNode using (1). It's also possible that the optimal stripe size (for flash) is HW dependent -- since the variance in performance characteristics between different flash devices can be rather large.

> 
> >
> >>
> >> Fewer bdev flushes, and only marginally more writes to the device.  I
> >> guess the question is is whether write-amp is really that important
> >> for a 4k workload?
> >>
> >> The upside of a larger min_alloc_size is the worst case metadata
> >> (onode) size is 1/2 or 1/4.  The sequential read cost of a previously
> >> random-written object will also be better (fewer IOs).
> >>
> >> There is probably a case where 4k min_alloc_size is the right choice
> >> but it feels like we're optimizing for write-amp to the detriment of
> >> other more important things.  For example, even after we improve the
> >> onode encoding, it may be that the larger metadata results in more
> >> write-amp than the WAL for the 4k writes does.
> >>
> >> sage
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> majordomo
> >> info at  http://vger.kernel.org/majordomo-info.html
> >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html