RE: RocksDB tuning

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> -----Original Message-----
> From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> Sent: Tuesday, June 14, 2016 4:18 AM
> To: Igor Fedotov <ifedotov@xxxxxxxxxxxx>
> Cc: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Somnath Roy
> <Somnath.Roy@xxxxxxxxxxx>; Mark Nelson <mnelson@xxxxxxxxxx>;
> Manavalan Krishnan <Manavalan.Krishnan@xxxxxxxxxxx>; Ceph
> Development <ceph-devel@xxxxxxxxxxxxxxx>
> Subject: Re: RocksDB tuning
> 
> On Tue, 14 Jun 2016, Igor Fedotov wrote:
> > This result are for compression = none and write block size limited to
> > 4K.
> 
> I've been thinking more about this and I'm wondering if we should revisit the
> choice to use a min_alloc_size of 4K on flash.  If it's 4K, then a 4K write means
> 
>  - 4K write (to newly allocated block)
>  - bdev flush
>  - kv commit (4k-ish?)
>  - bdev flush
> 
> which puts a 2 write lower bound on latency.  If we have min_alloc_size of 8K
> or 16K, then a 4K write is
> 
>  - kv commit (4K + 4k-ish)
>  - bdev flush
>  - [async] 4k write
> 
> Fewer bdev flushes, and only marginally more writes to the device.  I guess
> the question is is whether write-amp is really that important for a 4k
> workload?

I don't think most people would agree that 2 -> 3 comprises "marginally more" ;-)
Sadly, this is a critical benchmark for people (independent of whether it actually is representative of any workload) and going with the KV commit path will dramatically lower the measured performance.

The true difference in performance associated with this choice will only become apparent after we've put the oNode on a diet. Right now, the "4k-ish" commit of the oNode is much much larger and it hides the performance difference associated with this choice.

If you're running on a hybrid system, i.e., metadata on flash and HDD for raw data, then the second path is the right choice -- so we'll end up needing to support both code paths :)
 

> 
> The upside of a larger min_alloc_size is the worst case metadata (onode) size
> is 1/2 or 1/4.  The sequential read cost of a previously random-written object
> will also be better (fewer IOs).
> 
> There is probably a case where 4k min_alloc_size is the right choice but it
> feels like we're optimizing for write-amp to the detriment of other more
> important things.  For example, even after we improve the onode encoding,
> it may be that the larger metadata results in more write-amp than the WAL
> for the 4k writes does.
> 
> sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux