RE: RocksDB tuning

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Fri, 10 Jun 2016 05:09:44 +0000

I believe that the old onode design extent map is quite inefficient in the encoding when you're doing 4KB overwrites.

I believe this can be significantly improved with a modest bit of  -- low risk -- work.

I haven't pushed on it yet, because these data structures have been completely redone with the compression stuff.

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@xxxxxxxxxxx

> -----Original Message-----
> From: Somnath Roy
> Sent: Thursday, June 09, 2016 10:06 PM
> To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>
> Cc: Mark Nelson <mnelson@xxxxxxxxxx>; Manavalan Krishnan
> <Manavalan.Krishnan@xxxxxxxxxxx>; Ceph Development <ceph-
> devel@xxxxxxxxxxxxxxx>
> Subject: RE: RocksDB tuning
> 
> I think we didn't see this big inode size with old bluestore code during ZS
> integration..Also, the client throughput I am getting now is different than old
> code.
> Will try with 2mb stripe size and update..
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Allen Samuels
> Sent: Thursday, June 09, 2016 7:15 PM
> To: Somnath Roy
> Cc: Mark Nelson; Manavalan Krishnan; Ceph Development
> Subject: Re: RocksDB tuning
> 
> Yes we've seen this phenomenon with the zetascale work and it's been
> discussed before. Fundamental I believe that the legacy 4mb striping value
> size will need to be modified as well as some attention to efficient inode
> encoding.
> 
> Can you retry with 2mb stripe size? That should drop the inode size roughly in
> half.
> 
> Sent from my iPhone. Please excuse all typos and autocorrects.
> 
> > On Jun 9, 2016, at 7:11 PM, Somnath Roy <Somnath.Roy@xxxxxxxxxxx>
> wrote:
> >
> > Yes Allen..
> >
> > -----Original Message-----
> > From: Allen Samuels
> > Sent: Thursday, June 09, 2016 7:09 PM
> > To: Somnath Roy
> > Cc: Mark Nelson; Manavalan Krishnan; Ceph Development
> > Subject: Re: RocksDB tuning
> >
> > You are doing random 4K writes to an rbd device. Right?
> >
> > Sent from my iPhone. Please excuse all typos and autocorrects.
> >
> >> On Jun 9, 2016, at 7:06 PM, Somnath Roy <Somnath.Roy@xxxxxxxxxxx>
> wrote:
> >>
> >> Sage/Mark,
> >> I debugged the code and it seems there is no WAL write going on and
> working as expected. But, in the process, I found that onode size it is writing
> to my environment ~7K !! See this debug print.
> >>
> >> 2016-06-09 15:49:24.710149 7f7732fe3700 20
> bluestore(/var/lib/ceph/osd/ceph-0)   onode
> #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is 7518
> >>
> >> This  explains why so much data going to rocksdb I guess. Once
> compaction kicks in iops I am getting is *30 times* slower.
> >>
> >> I have 15 osds on 8TB drives and I have created 4TB rbd image
> preconditioned with 1M. I was running 4K RW test.
> >>
> >> Thanks & Regards
> >> Somnath
> >>
> >> -----Original Message-----
> >> From: ceph-devel-owner@xxxxxxxxxxxxxxx
> >> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Somnath Roy
> >> Sent: Thursday, June 09, 2016 8:23 AM
> >> To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph Development
> >> Subject: RE: RocksDB tuning
> >>
> >> Mark,
> >> As we discussed, it seems there is ~5X write amp on the system with 4K
> RW. Considering the amount of data going into rocksdb (and thus kicking of
> compaction so fast and degrading performance drastically) , it seems it is still
> writing WAL (?)..I used the following rocksdb option for faster background
> compaction as well hoping it can keep up with upcoming writes and writes
> won't be stalling. But, eventually, after a min or so, it is stalling io..
> >>
> >> bluestore_rocksdb_options =
> "compression=kNoCompression,max_write_buffer_number=16,min_write_
> buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=k
> CompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=6
> 7108864,max_background_compactions=31,level0_file_num_compaction_tri
> gger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,
> num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level
> _multiplier=8,compaction_threads=32,flusher_threads=8"
> >>
> >> I will try to debug what is going on there..
> >>
> >> Thanks & Regards
> >> Somnath
> >>
> >> -----Original Message-----
> >> From: ceph-devel-owner@xxxxxxxxxxxxxxx
> >> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson
> >> Sent: Thursday, June 09, 2016 6:46 AM
> >> To: Allen Samuels; Manavalan Krishnan; Ceph Development
> >> Subject: Re: RocksDB tuning
> >>
> >>> On 06/09/2016 08:37 AM, Mark Nelson wrote:
> >>> Hi Allen,
> >>>
> >>> On a somewhat related note, I wanted to mention that I had forgotten
> >>> that chhabaremesh's min_alloc_size commit for different media types
> >>> was committed into master:
> >>>
> >>>
> https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335
> >>> e
> >>> 3
> >>> efd187
> >>>
> >>>
> >>> IE those tests appear to already have been using a 4K min alloc size
> >>> due to non-rotational NVMe media.  I went back and verified that
> >>> explicitly changing the min_alloc size (in fact all of them to be
> >>> sure) to 4k does not change the behavior from graphs I showed
> >>> yesterday.  The rocksdb compaction stalls due to excessive reads
> >>> appear (at least on the
> >>> surface) to be due to metadata traffic during heavy small random writes.
> >>
> >> Sorry, this was worded poorly.  Traffic due to compaction of metadata (ie
> not leaked WAL data) during small random writes.
> >>
> >> Mark
> >>
> >>>
> >>> Mark
> >>>
> >>>> On 06/08/2016 06:52 PM, Allen Samuels wrote:
> >>>> Let's make a patch that creates actual Ceph parameters for these
> >>>> things so that we don't have to edit the source code in the future.
> >>>>
> >>>>
> >>>> Allen Samuels
> >>>> SanDisk |a Western Digital brand
> >>>> 2880 Junction Avenue, San Jose, CA 95134
> >>>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx
> >>>>
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> >>>>> owner@xxxxxxxxxxxxxxx] On Behalf Of Manavalan Krishnan
> >>>>> Sent: Wednesday, June 08, 2016 3:10 PM
> >>>>> To: Mark Nelson <mnelson@xxxxxxxxxx>; Ceph Development
> <ceph-
> >>>>> devel@xxxxxxxxxxxxxxx>
> >>>>> Subject: RocksDB tuning
> >>>>>
> >>>>> Hi Mark
> >>>>>
> >>>>> Here are the tunings that we used to avoid the IOPs choppiness
> >>>>> caused by rocksdb compaction.
> >>>>>
> >>>>> We need to add the following options in src/kv/RocksDBStore.cc
> >>>>> before rocksdb::DB::Open in RocksDBStore::do_open
> >>>>> opt.IncreaseParallelism(16);
> >>>>> opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
> >>>>>
> >>>>>
> >>>>>
> >>>>> Thanks
> >>>>> Mana
> >>>>>
> >>>>>
> >>>>>
> >>>>> PLEASE NOTE: The information contained in this electronic mail
> >>>>> message is intended only for the use of the designated
> >>>>> recipient(s) named above.
> >>>>> If the
> >>>>> reader of this message is not the intended recipient, you are
> >>>>> hereby notified that you have received this message in error and
> >>>>> that any review, dissemination, distribution, or copying of this
> >>>>> message is strictly prohibited. If you have received this
> >>>>> communication in error, please notify the sender by telephone or
> >>>>> e-mail (as shown
> >>>>> above) immediately and destroy any and all copies of this message
> >>>>> in your possession (whether hard copies or electronically stored
> >>>>> copies).
> >>>>> --
> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>>> in the
> >>>>> body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
> info
> >>>>> at http://vger.kernel.org/majordomo-info.html
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> >>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> majordomo
> >>> info at  http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> majordomo
> >> info at  http://vger.kernel.org/majordomo-info.html
> >> PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby notified
> that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly prohibited. If
> you have received this communication in error, please notify the sender by
> telephone or e-mail (as shown above) immediately and destroy any and all
> copies of this message in your possession (whether hard copies or
> electronically stored copies).
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> majordomo
> >> info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html