RE: RocksDB tuning

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Fri, 10 Jun 2016 15:06:10 +0000

> -----Original Message-----
> From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> Sent: Friday, June 10, 2016 7:55 AM
> To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>
> Cc: Somnath Roy <Somnath.Roy@xxxxxxxxxxx>; Mark Nelson
> <mnelson@xxxxxxxxxx>; Manavalan Krishnan
> <Manavalan.Krishnan@xxxxxxxxxxx>; Ceph Development <ceph-
> devel@xxxxxxxxxxxxxxx>
> Subject: RE: RocksDB tuning
> 
> On Fri, 10 Jun 2016, Allen Samuels wrote:
> > Checksums are definitely a part of the problem, but I suspect the
> > smaller part of the problem. This particular use-case (random 4K
> > overwrites without the WAL stuff) is the worst-case from an encoding
> > perspective and highlights the inefficiency in the current code.
> >
> > As has been discussed earlier, a specialized encode/decode
> > implementation for these data structures is clearly called for.
> >
> > IMO, you'll be able to cut the size of this by AT LEAST a factor of 3
> > or
> > 4 without a lot of effort. The price will be somewhat increase CPU
> > cost for the serialize/deserialize operation.
> >
> > If you think of this as an application-specific data compression
> > problem, here is a short list of potential compression opportunities.
> >
> > (1) Encoded sizes and offsets are 8-byte byte values, converting these too
> block values will drop 9 or 12 bits from each value. Also, the ranges for these
> values is usually only 2^22 -- often much less. Meaning that there's 3-5 bytes
> of zeros at the top of each word that can be dropped.
> > (2) Encoded device addresses are often less than 2^32, meaning there's 3-4
> bytes of zeros at the top of each word that can be dropped.
> >  (3) Encoded offsets and sizes are often exactly "1" block, clever choices of
> formatting can eliminate these entirely.
> >
> > IMO, an optimized encoded form of the extent table will be around 1/4
> > of the current encoding (for this use-case) and will likely result in
> > an Onode that's only 1/3 of the size that Somnath is seeing.
> 
> That will be true for the lextent and blob extent maps.  I'm guessing this is a
> small part of the ~5K somnath saw.  If his objects are 4MB then 4KB of it
> (80%) is the csum_data vector, which is a flat vector of
> u32 values that are presumably not very compressible.

I don't think that's what Somnath is seeing (obviously some data here will sharpen up our speculations). But in his use case, I believe that he has a separate blob and pextent for each 4K write (since it's been subjected to random 4K overwrites), that means somewhere in the data structures at least one address and one length for each of the 4K blocks (and likely much more in the lextent and blob maps as you alluded to above). The encoding of just this information alone is larger than the checksum data.

> 
> We could perhaps break these into a separate key or keyspace.. That'll give
> rocksdb a bit more computation work to do (for a custom merge operator,
> probably, to update just a piece of the value) but for a 4KB value I'm not sure
> it's big enough to really help.  Also we'd lose locality, would need a second
> get to load csum metadata on read, etc.  :/  I don't really have any good ideas
> here.
> 
> sage
> 
> 
> >
> > Allen Samuels
> > SanDisk |a Western Digital brand
> > 2880 Junction Avenue, Milpitas, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416
> > allen.samuels@xxxxxxxxxxx
> >
> >
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> > > Sent: Friday, June 10, 2016 2:35 AM
> > > To: Somnath Roy <Somnath.Roy@xxxxxxxxxxx>
> > > Cc: Mark Nelson <mnelson@xxxxxxxxxx>; Allen Samuels
> > > <Allen.Samuels@xxxxxxxxxxx>; Manavalan Krishnan
> > > <Manavalan.Krishnan@xxxxxxxxxxx>; Ceph Development <ceph-
> > > devel@xxxxxxxxxxxxxxx>
> > > Subject: RE: RocksDB tuning
> > >
> > > On Fri, 10 Jun 2016, Somnath Roy wrote:
> > > > Sage/Mark,
> > > > I debugged the code and it seems there is no WAL write going on and
> > > working as expected. But, in the process, I found that onode size it is
> writing
> > > to my environment ~7K !! See this debug print.
> > > >
> > > > 2016-06-09 15:49:24.710149 7f7732fe3700 20
> > > bluestore(/var/lib/ceph/osd/ceph-0)   onode
> > > #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is 7518
> > > >
> > > > This explains why so much data going to rocksdb I guess. Once
> > > > compaction kicks in iops I am getting is *30 times* slower.
> > > >
> > > > I have 15 osds on 8TB drives and I have created 4TB rbd image
> > > > preconditioned with 1M. I was running 4K RW test.
> > >
> > > The onode is big because of the csum metdata.  Try setting 'bluestore
> csum
> > > type = none' and see if that is the entire reason or if something else is
> going
> > > on.
> > >
> > > We may need to reconsider the way this is stored.
> > >
> > > s
> > >
> > >
> > >
> > >
> > > >
> > > > Thanks & Regards
> > > > Somnath
> > > >
> > > > -----Original Message-----
> > > > From: ceph-devel-owner@xxxxxxxxxxxxxxx
> > > > [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Somnath
> Roy
> > > > Sent: Thursday, June 09, 2016 8:23 AM
> > > > To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph
> Development
> > > > Subject: RE: RocksDB tuning
> > > >
> > > > Mark,
> > > > As we discussed, it seems there is ~5X write amp on the system with 4K
> > > RW. Considering the amount of data going into rocksdb (and thus kicking
> of
> > > compaction so fast and degrading performance drastically) , it seems it is
> still
> > > writing WAL (?)..I used the following rocksdb option for faster
> background
> > > compaction as well hoping it can keep up with upcoming writes and
> writes
> > > won't be stalling. But, eventually, after a min or so, it is stalling io..
> > > >
> > > > bluestore_rocksdb_options =
> > >
> "compression=kNoCompression,max_write_buffer_number=16,min_write_
> > >
> buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=k
> > >
> CompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=6
> > >
> 7108864,max_background_compactions=31,level0_file_num_compaction_tri
> > >
> gger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,
> > >
> num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level
> > > _multiplier=8,compaction_threads=32,flusher_threads=8"
> > > >
> > > > I will try to debug what is going on there..
> > > >
> > > > Thanks & Regards
> > > > Somnath
> > > >
> > > > -----Original Message-----
> > > > From: ceph-devel-owner@xxxxxxxxxxxxxxx
> > > > [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson
> > > > Sent: Thursday, June 09, 2016 6:46 AM
> > > > To: Allen Samuels; Manavalan Krishnan; Ceph Development
> > > > Subject: Re: RocksDB tuning
> > > >
> > > > On 06/09/2016 08:37 AM, Mark Nelson wrote:
> > > > > Hi Allen,
> > > > >
> > > > > On a somewhat related note, I wanted to mention that I had
> forgotten
> > > > > that chhabaremesh's min_alloc_size commit for different media types
> > > > > was committed into master:
> > > > >
> > > > >
> > >
> https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335
> > > > > e3
> > > > > efd187
> > > > >
> > > > >
> > > > > IE those tests appear to already have been using a 4K min alloc size
> > > > > due to non-rotational NVMe media.  I went back and verified that
> > > > > explicitly changing the min_alloc size (in fact all of them to be
> > > > > sure) to 4k does not change the behavior from graphs I showed
> > > > > yesterday.  The rocksdb compaction stalls due to excessive reads
> > > > > appear (at least on the
> > > > > surface) to be due to metadata traffic during heavy small random
> writes.
> > > >
> > > > Sorry, this was worded poorly.  Traffic due to compaction of metadata
> (ie
> > > not leaked WAL data) during small random writes.
> > > >
> > > > Mark
> > > >
> > > > >
> > > > > Mark
> > > > >
> > > > > On 06/08/2016 06:52 PM, Allen Samuels wrote:
> > > > >> Let's make a patch that creates actual Ceph parameters for these
> > > > >> things so that we don't have to edit the source code in the future.
> > > > >>
> > > > >>
> > > > >> Allen Samuels
> > > > >> SanDisk |a Western Digital brand
> > > > >> 2880 Junction Avenue, San Jose, CA 95134
> > > > >> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx
> > > > >>
> > > > >>
> > > > >>> -----Original Message-----
> > > > >>> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> > > > >>> owner@xxxxxxxxxxxxxxx] On Behalf Of Manavalan Krishnan
> > > > >>> Sent: Wednesday, June 08, 2016 3:10 PM
> > > > >>> To: Mark Nelson <mnelson@xxxxxxxxxx>; Ceph Development
> <ceph-
> > > > >>> devel@xxxxxxxxxxxxxxx>
> > > > >>> Subject: RocksDB tuning
> > > > >>>
> > > > >>> Hi Mark
> > > > >>>
> > > > >>> Here are the tunings that we used to avoid the IOPs choppiness
> > > > >>> caused by rocksdb compaction.
> > > > >>>
> > > > >>> We need to add the following options in src/kv/RocksDBStore.cc
> > > > >>> before rocksdb::DB::Open in RocksDBStore::do_open
> > > opt.IncreaseParallelism(16);
> > > > >>>   opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> Thanks
> > > > >>> Mana
> > > > >>>
> > > > >>>
> > > > >>>>
> > > > >>>
> > > > >>> PLEASE NOTE: The information contained in this electronic mail
> > > > >>> message is intended only for the use of the designated
> > > > >>> recipient(s) named above.
> > > > >>> If the
> > > > >>> reader of this message is not the intended recipient, you are
> > > > >>> hereby notified that you have received this message in error and
> > > > >>> that any review, dissemination, distribution, or copying of this
> > > > >>> message is strictly prohibited. If you have received this
> > > > >>> communication in error, please notify the sender by telephone or
> > > > >>> e-mail (as shown
> > > > >>> above) immediately and destroy any and all copies of this message
> > > > >>> in your possession (whether hard copies or electronically stored
> > > > >>> copies).
> > > > >>> --
> > > > >>> To unsubscribe from this list: send the line "unsubscribe ceph-
> devel"
> > > > >>> in the
> > > > >>> body of a message to majordomo@xxxxxxxxxxxxxxx More
> majordomo
> > > info
> > > > >>> at http://vger.kernel.org/majordomo-info.html
> > > > >> --
> > > > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > >> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> > > > >> majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > >>
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> > > majordomo
> > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> > > majordomo
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > PLEASE NOTE: The information contained in this electronic mail message
> is
> > > intended only for the use of the designated recipient(s) named above. If
> the
> > > reader of this message is not the intended recipient, you are hereby
> notified
> > > that you have received this message in error and that any review,
> > > dissemination, distribution, or copying of this message is strictly
> prohibited. If
> > > you have received this communication in error, please notify the sender
> by
> > > telephone or e-mail (as shown above) immediately and destroy any and
> all
> > > copies of this message in your possession (whether hard copies or
> > > electronically stored copies).
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> > > majordomo
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> > > majordomo
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > >
> > > >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html