RE: RocksDB tuning

Sage Weil <sweil@xxxxxxxxxx> · Fri, 10 Jun 2016 10:54:51 -0400 (EDT)

On Fri, 10 Jun 2016, Allen Samuels wrote:
> Checksums are definitely a part of the problem, but I suspect the 
> smaller part of the problem. This particular use-case (random 4K 
> overwrites without the WAL stuff) is the worst-case from an encoding 
> perspective and highlights the inefficiency in the current code.
> 
> As has been discussed earlier, a specialized encode/decode 
> implementation for these data structures is clearly called for.
> 
> IMO, you'll be able to cut the size of this by AT LEAST a factor of 3 or 
> 4 without a lot of effort. The price will be somewhat increase CPU cost 
> for the serialize/deserialize operation.
> 
> If you think of this as an application-specific data compression 
> problem, here is a short list of potential compression opportunities.
> 
> (1) Encoded sizes and offsets are 8-byte byte values, converting these too block values will drop 9 or 12 bits from each value. Also, the ranges for these values is usually only 2^22 -- often much less. Meaning that there's 3-5 bytes of zeros at the top of each word that can be dropped.
> (2) Encoded device addresses are often less than 2^32, meaning there's 3-4 bytes of zeros at the top of each word that can be dropped.
>  (3) Encoded offsets and sizes are often exactly "1" block, clever choices of formatting can eliminate these entirely.
> 
> IMO, an optimized encoded form of the extent table will be around 1/4 of 
> the current encoding (for this use-case) and will likely result in an 
> Onode that's only 1/3 of the size that Somnath is seeing.

That will be true for the lextent and blob extent maps.  I'm guessing 
this is a small part of the ~5K somnath saw.  If his objects are 4MB 
then 4KB of it (80%) is the csum_data vector, which is a flat vector of 
u32 values that are presumably not very compressible.

We could perhaps break these into a separate key or keyspace.. That'll 
give rocksdb a bit more computation work to do (for a custom merge 
operator, probably, to update just a piece of the value) but for a 4KB 
value I'm not sure it's big enough to really help.  Also we'd lose 
locality, would need a second get to load csum metadata on 
read, etc.  :/  I don't really have any good ideas here.

sage

> 
> Allen Samuels
> SanDisk |a Western Digital brand
> 2880 Junction Avenue, Milpitas, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@xxxxxxxxxxx
> 
> 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> > Sent: Friday, June 10, 2016 2:35 AM
> > To: Somnath Roy <Somnath.Roy@xxxxxxxxxxx>
> > Cc: Mark Nelson <mnelson@xxxxxxxxxx>; Allen Samuels
> > <Allen.Samuels@xxxxxxxxxxx>; Manavalan Krishnan
> > <Manavalan.Krishnan@xxxxxxxxxxx>; Ceph Development <ceph-
> > devel@xxxxxxxxxxxxxxx>
> > Subject: RE: RocksDB tuning
> > 
> > On Fri, 10 Jun 2016, Somnath Roy wrote:
> > > Sage/Mark,
> > > I debugged the code and it seems there is no WAL write going on and
> > working as expected. But, in the process, I found that onode size it is writing
> > to my environment ~7K !! See this debug print.
> > >
> > > 2016-06-09 15:49:24.710149 7f7732fe3700 20
> > bluestore(/var/lib/ceph/osd/ceph-0)   onode
> > #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is 7518
> > >
> > > This explains why so much data going to rocksdb I guess. Once
> > > compaction kicks in iops I am getting is *30 times* slower.
> > >
> > > I have 15 osds on 8TB drives and I have created 4TB rbd image
> > > preconditioned with 1M. I was running 4K RW test.
> > 
> > The onode is big because of the csum metdata.  Try setting 'bluestore csum
> > type = none' and see if that is the entire reason or if something else is going
> > on.
> > 
> > We may need to reconsider the way this is stored.
> > 
> > s
> > 
> > 
> > 
> > 
> > >
> > > Thanks & Regards
> > > Somnath
> > >
> > > -----Original Message-----
> > > From: ceph-devel-owner@xxxxxxxxxxxxxxx
> > > [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Somnath Roy
> > > Sent: Thursday, June 09, 2016 8:23 AM
> > > To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph Development
> > > Subject: RE: RocksDB tuning
> > >
> > > Mark,
> > > As we discussed, it seems there is ~5X write amp on the system with 4K
> > RW. Considering the amount of data going into rocksdb (and thus kicking of
> > compaction so fast and degrading performance drastically) , it seems it is still
> > writing WAL (?)..I used the following rocksdb option for faster background
> > compaction as well hoping it can keep up with upcoming writes and writes
> > won't be stalling. But, eventually, after a min or so, it is stalling io..
> > >
> > > bluestore_rocksdb_options =
> > "compression=kNoCompression,max_write_buffer_number=16,min_write_
> > buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=k
> > CompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=6
> > 7108864,max_background_compactions=31,level0_file_num_compaction_tri
> > gger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,
> > num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level
> > _multiplier=8,compaction_threads=32,flusher_threads=8"
> > >
> > > I will try to debug what is going on there..
> > >
> > > Thanks & Regards
> > > Somnath
> > >
> > > -----Original Message-----
> > > From: ceph-devel-owner@xxxxxxxxxxxxxxx
> > > [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson
> > > Sent: Thursday, June 09, 2016 6:46 AM
> > > To: Allen Samuels; Manavalan Krishnan; Ceph Development
> > > Subject: Re: RocksDB tuning
> > >
> > > On 06/09/2016 08:37 AM, Mark Nelson wrote:
> > > > Hi Allen,
> > > >
> > > > On a somewhat related note, I wanted to mention that I had forgotten
> > > > that chhabaremesh's min_alloc_size commit for different media types
> > > > was committed into master:
> > > >
> > > >
> > https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335
> > > > e3
> > > > efd187
> > > >
> > > >
> > > > IE those tests appear to already have been using a 4K min alloc size
> > > > due to non-rotational NVMe media.  I went back and verified that
> > > > explicitly changing the min_alloc size (in fact all of them to be
> > > > sure) to 4k does not change the behavior from graphs I showed
> > > > yesterday.  The rocksdb compaction stalls due to excessive reads
> > > > appear (at least on the
> > > > surface) to be due to metadata traffic during heavy small random writes.
> > >
> > > Sorry, this was worded poorly.  Traffic due to compaction of metadata (ie
> > not leaked WAL data) during small random writes.
> > >
> > > Mark
> > >
> > > >
> > > > Mark
> > > >
> > > > On 06/08/2016 06:52 PM, Allen Samuels wrote:
> > > >> Let's make a patch that creates actual Ceph parameters for these
> > > >> things so that we don't have to edit the source code in the future.
> > > >>
> > > >>
> > > >> Allen Samuels
> > > >> SanDisk |a Western Digital brand
> > > >> 2880 Junction Avenue, San Jose, CA 95134
> > > >> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx
> > > >>
> > > >>
> > > >>> -----Original Message-----
> > > >>> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> > > >>> owner@xxxxxxxxxxxxxxx] On Behalf Of Manavalan Krishnan
> > > >>> Sent: Wednesday, June 08, 2016 3:10 PM
> > > >>> To: Mark Nelson <mnelson@xxxxxxxxxx>; Ceph Development <ceph-
> > > >>> devel@xxxxxxxxxxxxxxx>
> > > >>> Subject: RocksDB tuning
> > > >>>
> > > >>> Hi Mark
> > > >>>
> > > >>> Here are the tunings that we used to avoid the IOPs choppiness
> > > >>> caused by rocksdb compaction.
> > > >>>
> > > >>> We need to add the following options in src/kv/RocksDBStore.cc
> > > >>> before rocksdb::DB::Open in RocksDBStore::do_open
> > opt.IncreaseParallelism(16);
> > > >>>   opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
> > > >>>
> > > >>>
> > > >>>
> > > >>> Thanks
> > > >>> Mana
> > > >>>
> > > >>>
> > > >>>>
> > > >>>
> > > >>> PLEASE NOTE: The information contained in this electronic mail
> > > >>> message is intended only for the use of the designated
> > > >>> recipient(s) named above.
> > > >>> If the
> > > >>> reader of this message is not the intended recipient, you are
> > > >>> hereby notified that you have received this message in error and
> > > >>> that any review, dissemination, distribution, or copying of this
> > > >>> message is strictly prohibited. If you have received this
> > > >>> communication in error, please notify the sender by telephone or
> > > >>> e-mail (as shown
> > > >>> above) immediately and destroy any and all copies of this message
> > > >>> in your possession (whether hard copies or electronically stored
> > > >>> copies).
> > > >>> --
> > > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > >>> in the
> > > >>> body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
> > info
> > > >>> at http://vger.kernel.org/majordomo-info.html
> > > >> --
> > > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > >> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> > > >> majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > >>
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> > majordomo
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> > majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> > > PLEASE NOTE: The information contained in this electronic mail message is
> > intended only for the use of the designated recipient(s) named above. If the
> > reader of this message is not the intended recipient, you are hereby notified
> > that you have received this message in error and that any review,
> > dissemination, distribution, or copying of this message is strictly prohibited. If
> > you have received this communication in error, please notify the sender by
> > telephone or e-mail (as shown above) immediately and destroy any and all
> > copies of this message in your possession (whether hard copies or
> > electronically stored copies).
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> > majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> > majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> > >
> > >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html