RE: RocksDB tuning

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Tue, 14 Jun 2016 15:01:33 +0000

> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@xxxxxxxxxx]
> Sent: Tuesday, June 14, 2016 4:54 AM
> To: Sage Weil <sweil@xxxxxxxxxx>; Igor Fedotov <ifedotov@xxxxxxxxxxxx>
> Cc: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Somnath Roy
> <Somnath.Roy@xxxxxxxxxxx>; Manavalan Krishnan
> <Manavalan.Krishnan@xxxxxxxxxxx>; Ceph Development <ceph-
> devel@xxxxxxxxxxxxxxx>
> Subject: Re: RocksDB tuning
> 
> 
> 
> On 06/14/2016 06:17 AM, Sage Weil wrote:
> > On Tue, 14 Jun 2016, Igor Fedotov wrote:
> >> This result are for compression = none and write block size limited
> >> to 4K.
> >
> > I've been thinking more about this and I'm wondering if we should
> > revisit the choice to use a min_alloc_size of 4K on flash.  If it's
> > 4K, then a 4K write means
> >
> >  - 4K write (to newly allocated block)
> >  - bdev flush
> >  - kv commit (4k-ish?)
> >  - bdev flush
> 
> AFAIK these flushes should happen async under the hood (ie almost free) on
> devices with proper power loss protection.

Correct, from the device perspective. However, you're still burning CPU time on the host which is often the bottleneck for flash performance.

It'll pay to have a toggle to disable the bdev flushes when you're known to be running with enterprise-grade devices (i.e., "proper power loss protection")

> 
> >
> > which puts a 2 write lower bound on latency.  If we have
> > min_alloc_size of 8K or 16K, then a 4K write is
> >
> >  - kv commit (4K + 4k-ish)
> >  - bdev flush
> >  - [async] 4k write
> 
> Given what I've seen about how rocksdb behaves (even on ramdisk), I think
> this is actually going to be worse than above in a lot of cases.
> I could be wrong though.  For SSDs that don't have PLP this might be
> significantly faster.
> 
> >
> > Fewer bdev flushes, and only marginally more writes to the device.  I
> > guess the question is is whether write-amp is really that important for a
> > 4k workload?
> >
> > The upside of a larger min_alloc_size is the worst case metadata (onode)
> > size is 1/2 or 1/4.  The sequential read cost of a previously
> > random-written object will also be better (fewer IOs).
> >
> > There is probably a case where 4k min_alloc_size is the right choice but
> > it feels like we're optimizing for write-amp to the detriment of other
> > more important things.  For example, even after we improve the onode
> > encoding, it may be that the larger metadata results in more write-amp
> > than the WAL for the 4k writes does.
> >
> > sage
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html