Oh, and use 16-bit checksums :) Allen Samuels SanDisk |a Western Digital brand 2880 Junction Avenue, Milpitas, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx > -----Original Message----- > From: Sage Weil [mailto:sweil@xxxxxxxxxx] > Sent: Friday, June 10, 2016 7:55 AM > To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx> > Cc: Somnath Roy <Somnath.Roy@xxxxxxxxxxx>; Mark Nelson > <mnelson@xxxxxxxxxx>; Manavalan Krishnan > <Manavalan.Krishnan@xxxxxxxxxxx>; Ceph Development <ceph- > devel@xxxxxxxxxxxxxxx> > Subject: RE: RocksDB tuning > > On Fri, 10 Jun 2016, Allen Samuels wrote: > > Checksums are definitely a part of the problem, but I suspect the > > smaller part of the problem. This particular use-case (random 4K > > overwrites without the WAL stuff) is the worst-case from an encoding > > perspective and highlights the inefficiency in the current code. > > > > As has been discussed earlier, a specialized encode/decode > > implementation for these data structures is clearly called for. > > > > IMO, you'll be able to cut the size of this by AT LEAST a factor of 3 > > or > > 4 without a lot of effort. The price will be somewhat increase CPU > > cost for the serialize/deserialize operation. > > > > If you think of this as an application-specific data compression > > problem, here is a short list of potential compression opportunities. > > > > (1) Encoded sizes and offsets are 8-byte byte values, converting these too > block values will drop 9 or 12 bits from each value. Also, the ranges for these > values is usually only 2^22 -- often much less. Meaning that there's 3-5 bytes > of zeros at the top of each word that can be dropped. > > (2) Encoded device addresses are often less than 2^32, meaning there's 3-4 > bytes of zeros at the top of each word that can be dropped. > > (3) Encoded offsets and sizes are often exactly "1" block, clever choices of > formatting can eliminate these entirely. > > > > IMO, an optimized encoded form of the extent table will be around 1/4 > > of the current encoding (for this use-case) and will likely result in > > an Onode that's only 1/3 of the size that Somnath is seeing. > > That will be true for the lextent and blob extent maps. I'm guessing this is a > small part of the ~5K somnath saw. If his objects are 4MB then 4KB of it > (80%) is the csum_data vector, which is a flat vector of > u32 values that are presumably not very compressible. > > We could perhaps break these into a separate key or keyspace.. That'll give > rocksdb a bit more computation work to do (for a custom merge operator, > probably, to update just a piece of the value) but for a 4KB value I'm not sure > it's big enough to really help. Also we'd lose locality, would need a second > get to load csum metadata on read, etc. :/ I don't really have any good ideas > here. > > sage > > > > > > Allen Samuels > > SanDisk |a Western Digital brand > > 2880 Junction Avenue, Milpitas, CA 95134 > > T: +1 408 801 7030| M: +1 408 780 6416 > > allen.samuels@xxxxxxxxxxx > > > > > > > -----Original Message----- > > > From: Sage Weil [mailto:sweil@xxxxxxxxxx] > > > Sent: Friday, June 10, 2016 2:35 AM > > > To: Somnath Roy <Somnath.Roy@xxxxxxxxxxx> > > > Cc: Mark Nelson <mnelson@xxxxxxxxxx>; Allen Samuels > > > <Allen.Samuels@xxxxxxxxxxx>; Manavalan Krishnan > > > <Manavalan.Krishnan@xxxxxxxxxxx>; Ceph Development <ceph- > > > devel@xxxxxxxxxxxxxxx> > > > Subject: RE: RocksDB tuning > > > > > > On Fri, 10 Jun 2016, Somnath Roy wrote: > > > > Sage/Mark, > > > > I debugged the code and it seems there is no WAL write going on and > > > working as expected. But, in the process, I found that onode size it is > writing > > > to my environment ~7K !! See this debug print. > > > > > > > > 2016-06-09 15:49:24.710149 7f7732fe3700 20 > > > bluestore(/var/lib/ceph/osd/ceph-0) onode > > > #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is 7518 > > > > > > > > This explains why so much data going to rocksdb I guess. Once > > > > compaction kicks in iops I am getting is *30 times* slower. > > > > > > > > I have 15 osds on 8TB drives and I have created 4TB rbd image > > > > preconditioned with 1M. I was running 4K RW test. > > > > > > The onode is big because of the csum metdata. Try setting 'bluestore > csum > > > type = none' and see if that is the entire reason or if something else is > going > > > on. > > > > > > We may need to reconsider the way this is stored. > > > > > > s > > > > > > > > > > > > > > > > > > > > Thanks & Regards > > > > Somnath > > > > > > > > -----Original Message----- > > > > From: ceph-devel-owner@xxxxxxxxxxxxxxx > > > > [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Somnath > Roy > > > > Sent: Thursday, June 09, 2016 8:23 AM > > > > To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph > Development > > > > Subject: RE: RocksDB tuning > > > > > > > > Mark, > > > > As we discussed, it seems there is ~5X write amp on the system with 4K > > > RW. Considering the amount of data going into rocksdb (and thus kicking > of > > > compaction so fast and degrading performance drastically) , it seems it is > still > > > writing WAL (?)..I used the following rocksdb option for faster > background > > > compaction as well hoping it can keep up with upcoming writes and > writes > > > won't be stalling. But, eventually, after a min or so, it is stalling io.. > > > > > > > > bluestore_rocksdb_options = > > > > "compression=kNoCompression,max_write_buffer_number=16,min_write_ > > > > buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=k > > > > CompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=6 > > > > 7108864,max_background_compactions=31,level0_file_num_compaction_tri > > > > gger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64, > > > > num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level > > > _multiplier=8,compaction_threads=32,flusher_threads=8" > > > > > > > > I will try to debug what is going on there.. > > > > > > > > Thanks & Regards > > > > Somnath > > > > > > > > -----Original Message----- > > > > From: ceph-devel-owner@xxxxxxxxxxxxxxx > > > > [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson > > > > Sent: Thursday, June 09, 2016 6:46 AM > > > > To: Allen Samuels; Manavalan Krishnan; Ceph Development > > > > Subject: Re: RocksDB tuning > > > > > > > > On 06/09/2016 08:37 AM, Mark Nelson wrote: > > > > > Hi Allen, > > > > > > > > > > On a somewhat related note, I wanted to mention that I had > forgotten > > > > > that chhabaremesh's min_alloc_size commit for different media types > > > > > was committed into master: > > > > > > > > > > > > > > https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335 > > > > > e3 > > > > > efd187 > > > > > > > > > > > > > > > IE those tests appear to already have been using a 4K min alloc size > > > > > due to non-rotational NVMe media. I went back and verified that > > > > > explicitly changing the min_alloc size (in fact all of them to be > > > > > sure) to 4k does not change the behavior from graphs I showed > > > > > yesterday. The rocksdb compaction stalls due to excessive reads > > > > > appear (at least on the > > > > > surface) to be due to metadata traffic during heavy small random > writes. > > > > > > > > Sorry, this was worded poorly. Traffic due to compaction of metadata > (ie > > > not leaked WAL data) during small random writes. > > > > > > > > Mark > > > > > > > > > > > > > > Mark > > > > > > > > > > On 06/08/2016 06:52 PM, Allen Samuels wrote: > > > > >> Let's make a patch that creates actual Ceph parameters for these > > > > >> things so that we don't have to edit the source code in the future. > > > > >> > > > > >> > > > > >> Allen Samuels > > > > >> SanDisk |a Western Digital brand > > > > >> 2880 Junction Avenue, San Jose, CA 95134 > > > > >> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx > > > > >> > > > > >> > > > > >>> -----Original Message----- > > > > >>> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel- > > > > >>> owner@xxxxxxxxxxxxxxx] On Behalf Of Manavalan Krishnan > > > > >>> Sent: Wednesday, June 08, 2016 3:10 PM > > > > >>> To: Mark Nelson <mnelson@xxxxxxxxxx>; Ceph Development > <ceph- > > > > >>> devel@xxxxxxxxxxxxxxx> > > > > >>> Subject: RocksDB tuning > > > > >>> > > > > >>> Hi Mark > > > > >>> > > > > >>> Here are the tunings that we used to avoid the IOPs choppiness > > > > >>> caused by rocksdb compaction. > > > > >>> > > > > >>> We need to add the following options in src/kv/RocksDBStore.cc > > > > >>> before rocksdb::DB::Open in RocksDBStore::do_open > > > opt.IncreaseParallelism(16); > > > > >>> opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024); > > > > >>> > > > > >>> > > > > >>> > > > > >>> Thanks > > > > >>> Mana > > > > >>> > > > > >>> > > > > >>>> > > > > >>> > > > > >>> PLEASE NOTE: The information contained in this electronic mail > > > > >>> message is intended only for the use of the designated > > > > >>> recipient(s) named above. > > > > >>> If the > > > > >>> reader of this message is not the intended recipient, you are > > > > >>> hereby notified that you have received this message in error and > > > > >>> that any review, dissemination, distribution, or copying of this > > > > >>> message is strictly prohibited. If you have received this > > > > >>> communication in error, please notify the sender by telephone or > > > > >>> e-mail (as shown > > > > >>> above) immediately and destroy any and all copies of this message > > > > >>> in your possession (whether hard copies or electronically stored > > > > >>> copies). > > > > >>> -- > > > > >>> To unsubscribe from this list: send the line "unsubscribe ceph- > devel" > > > > >>> in the > > > > >>> body of a message to majordomo@xxxxxxxxxxxxxxx More > majordomo > > > info > > > > >>> at http://vger.kernel.org/majordomo-info.html > > > > >> -- > > > > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > > > >> in the body of a message to majordomo@xxxxxxxxxxxxxxx More > > > > >> majordomo info at http://vger.kernel.org/majordomo-info.html > > > > >> > > > > > -- > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > > > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More > > > majordomo > > > > > info at http://vger.kernel.org/majordomo-info.html > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More > > > majordomo > > > > info at http://vger.kernel.org/majordomo-info.html > > > > PLEASE NOTE: The information contained in this electronic mail message > is > > > intended only for the use of the designated recipient(s) named above. If > the > > > reader of this message is not the intended recipient, you are hereby > notified > > > that you have received this message in error and that any review, > > > dissemination, distribution, or copying of this message is strictly > prohibited. If > > > you have received this communication in error, please notify the sender > by > > > telephone or e-mail (as shown above) immediately and destroy any and > all > > > copies of this message in your possession (whether hard copies or > > > electronically stored copies). > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More > > > majordomo > > > > info at http://vger.kernel.org/majordomo-info.html > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More > > > majordomo > > > > info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html