RE: RocksDB tuning

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Sat, 11 Jun 2016 17:32:08 +0000

For a flash-based system we want to shrink the stripe size until we get an onode that's sufficiently "small". With rocksDB as the metadata store, we want "small" to be the size that minimizes the number of log writes (but not any smaller) -- it'll take a bit of energy to figure out what this number should be, we'll need to understand the sizes of the other KV pairs that are being committed at the same time and then take into account the packing algorithm in the level0 log file.. With ZetaScale it'll be different (easier to compute, oNode size shouldn't be larger than a ZS block (8K by default)).

I don't see any point in spending energy on quantifying this until we've finished a concerted effort to "shrink" the onode encoding overhead.

The shrink really has two parts. One part is relatively straightforward code to develop an efficient representation of the lextent/blob/pextent that's optimized for the expected use cases. The other part is behavioral changes in the write-path code (like preventing too many accumulated overwrites as Sage discussed earlier).

The first part is a pretty straightforward problem once you identify the cases that you want to optimize for. The second part is likely to be more subtle and rely on machinery that hasn't been fully implemented yet.

I would recommend that, short term, we focus on the first part and see how far that gets us. Also, I suspect there's more to be learned about the behavioral front before we're able to conclusively decide what the right action is here. 

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@xxxxxxxxxxx

> -----Original Message-----
> From: Somnath Roy
> Sent: Saturday, June 11, 2016 9:35 AM
> To: Sage Weil <sweil@xxxxxxxxxx>
> Cc: Igor Fedotov <ifedotov@xxxxxxxxxxxx>; Allen Samuels
> <Allen.Samuels@xxxxxxxxxxx>; Mark Nelson <mnelson@xxxxxxxxxx>;
> Manavalan Krishnan <Manavalan.Krishnan@xxxxxxxxxxx>; Ceph
> Development <ceph-devel@xxxxxxxxxxxxxxx>
> Subject: RE: RocksDB tuning
> 
> +devl
> Yes Sage, make sense..I will try that and also will reduce the object size to
> 2MB as Allen suggested and see the effect.
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> Sent: Saturday, June 11, 2016 6:18 AM
> To: Somnath Roy
> Cc: Igor Fedotov; Allen Samuels; Mark Nelson; Manavalan Krishnan
> Subject: RE: RocksDB tuning
> 
> On Sat, 11 Jun 2016, Somnath Roy wrote:
> > Removing devl as couldn't attach the graph..
> >
> >
> >
> > Please find the graph attached for 4K RW..
> >
> > I turned off crc but still onode size is 6K-9K range (checked randomly)..
> 
> Here's a simple test... remove the kNocompression option from the rocksdb
> options string and key if the compaction is more manageable if snappy has a
> go at it.
> 
> sage
> 
> 
>  >
> > Performance is similar..
> >
> >
> >
> >
> >
> > [IMAGE]
> >
> >
> >
> >
> >
> > Ran 10 jobs , each at peak giving ~4K , so aggregated output at peak
> > is ~40K…But, see the choppiness..
> >
> >
> >
> > Thanks & Regards
> >
> > Somnath
> >
> >
> >
> > -----Original Message-----
> > From: Somnath Roy
> > Sent: Friday, June 10, 2016 2:12 PM
> > To: 'Sage Weil'; Igor Fedotov
> > Cc: Allen Samuels; Mark Nelson; Manavalan Krishnan; Ceph Development
> > Subject: RE: RocksDB tuning
> >
> >
> >
> > Sage,
> >
> > By default 'bluestore_compression' is set to none with latest code. I
> > will recreate the cluster with checksum off and see..
> >
> > BTW, do I really need to mkfs or creating a new image (after
> > restarting osds with checksum off) should suffice as onodes will be
> > created during image writes ?
> >
> >
> >
> > Thanks & Regards
> >
> > Somnath
> >
> >
> >
> > -----Original Message-----
> >
> > From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> >
> > Sent: Friday, June 10, 2016 11:19 AM
> >
> > To: Igor Fedotov
> >
> > Cc: Allen Samuels; Somnath Roy; Mark Nelson; Manavalan Krishnan; Ceph
> > Development
> >
> > Subject: Re: RocksDB tuning
> >
> >
> >
> > On Fri, 10 Jun 2016, Igor Fedotov wrote:
> >
> > > An update:
> >
> > >
> >
> > > I found that my previous results were invalid -
> > > SyntheticWorkloadState
> >
> > > had an odd swap for offset > len case... Made a brief fix.
> >
> > >
> >
> > > Now onode size with csum raises up to 38K, without csum - 28K.
> >
> > >
> >
> > > For csum case there is 350 lextents and about 170 blobs
> >
> > >
> >
> > > For no csum - 343 lextents and about 170 blobs.
> >
> > >
> >
> > > (blobs counting is very inaccurate!)
> >
> > >
> >
> > > Potentially we shouldn't have >64 blobs per 4M thus looks like some
> >
> > > issues in the write path...
> >
> >
> >
> > Synthetic randomly twiddles alloc hints, which means some of those
> > blobs are probably getting compressed.  I suspect if you set
> > 'bluestore compression = none' it'll drop back down to 64.
> >
> >
> >
> > There is still a problem with compression, though.  I think the write
> > path should look at whether we are obscuring an existing blob with
> > more than N layers (where N is probably 2?) and if so do a read+write
> > 'compaction' to flatten it.  That (or something like it) should get us
> > a ~2x bound on the worst case lextent count (in this case ~128)...
> >
> >
> >
> > sage
> >
> >
> >
> > >
> >
> > > And CSum vs. NoCsum differenct looks pretty consistent - 170 blobs *
> > > 4
> >
> > > byte *
> >
> > > 16 values = 10880
> >
> > >
> >
> > > Branch's @github been updated with corresponding fixes.
> >
> > >
> >
> > > Thanks,
> >
> > > Igor.
> >
> > >
> >
> > > On 10.06.2016 19:06, Allen Samuels wrote:
> >
> > > > Let's see, 4MB is 2^22 bytes. If we storage a checksum for each
> > > > 2^12
> >
> > > > bytes that's 2^10 checksums at 2^2 bytes each is 2^12 bytes.
> >
> > > >
> >
> > > > So with optimal encoding, the checksum baggage shouldn't be more
> >
> > > > than 4KB per oNode.
> >
> > > >
> >
> > > > But you're seeing 13K as the upper bound on the onode size.
> >
> > > >
> >
> > > > In the worst case, you'll need at least another block address (8
> >
> > > > bytes
> >
> > > > currently) and length (another 8 bytes) [though as I point out,
> > > > the
> >
> > > > length is something that can be optimized out] So worst case, this
> >
> > > > encoding would be an addition 16KB per onode.
> >
> > > >
> >
> > > > I suspect you're not at the worst-case yet :)
> >
> > > >
> >
> > > > Allen Samuels
> >
> > > > SanDisk |a Western Digital brand
> >
> > > > 2880 Junction Avenue, Milpitas, CA 95134
> >
> > > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx
> >
> > > >
> >
> > > >
> >
> > > > > -----Original Message-----
> >
> > > > > From: Igor Fedotov [mailto:ifedotov@xxxxxxxxxxxx]
> >
> > > > > Sent: Friday, June 10, 2016 8:58 AM
> >
> > > > > To: Sage Weil <sweil@xxxxxxxxxx>; Somnath Roy
> >
> > > > > <Somnath.Roy@xxxxxxxxxxx>
> >
> > > > > Cc: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Mark Nelson
> >
> > > > > <mnelson@xxxxxxxxxx>; Manavalan Krishnan
> >
> > > > > <Manavalan.Krishnan@xxxxxxxxxxx>; Ceph Development <ceph-
> >
> > > > > devel@xxxxxxxxxxxxxxx>
> >
> > > > > Subject: Re: RocksDB tuning
> >
> > > > >
> >
> > > > > Just modified store_test synthetic test case to simulate many
> >
> > > > > random 4K writes to 4M object.
> >
> > > > >
> >
> > > > > With default settings ( crc32c + 4K block) onode size varies
> > > > > from
> >
> > > > > 2K to ~13K
> >
> > > > >
> >
> > > > > with disabled crc it's ~500 - 1300 bytes.
> >
> > > > >
> >
> > > > >
> >
> > > > > Hence the root cause seems to be in csum array.
> >
> > > > >
> >
> > > > >
> >
> > > > > Here is the updated branch:
> >
> > > > >
> >
> > > > > https://github.com/ifed01/ceph/tree/wip-bluestore-test-size
> >
> > > > >
> >
> > > > >
> >
> > > > > Thanks,
> >
> > > > >
> >
> > > > > Igor
> >
> > > > >
> >
> > > > >
> >
> > > > > On 10.06.2016 18:40, Sage Weil wrote:
> >
> > > > > > On Fri, 10 Jun 2016, Somnath Roy wrote:
> >
> > > > > > > Just turning off checksum with the below param is not
> > > > > > > helping,
> >
> > > > > > > I still need to see the onode size though by enabling
> >
> > > > > > > debug..Do I need to mkfs
> >
> > > > > > > (Sage?) as it is still holding checksum of old data I wrote ?
> >
> > > > > > Yeah.. you'll need to mkfs to blow away the old onodes and
> > > > > > blobs
> >
> > > > > > with csum data.
> >
> > > > > >
> >
> > > > > > As Allen pointed out, this is only part of the problem.. but
> > > > > > I'm
> >
> > > > > > curious how much!
> >
> > > > > >
> >
> > > > > > >           bluestore_csum = false
> >
> > > > > > >           bluestore_csum_type = none
> >
> > > > > > >
> >
> > > > > > > Here is the snippet of 'dstat'..
> >
> > > > > > >
> >
> > > > > > > ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
> >
> > > > > > > usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
> >
> > > > > > >    41  14  36   5   0   4| 138M  841M| 212M  145M|   0     0
> > > > > > >>
> >
> > > > > > >    42  14  35   5   0   4| 137M  855M| 213M  147M|   0     0
> > > > > > >>
> >
> > > > > > >    40  14  38   5   0   3| 143M  815M| 209M  144M|   0     0
> > > > > > >>
> >
> > > > > > >    40  14  38   5   0   3| 137M  933M| 194M  134M|   0     0
> > > > > > >>
> >
> > > > > > >    42  15  34   5   0   4| 133M  918M| 220M  151M|   0     0
> > > > > > >>
> >
> > > > > > >    35  13  43   6   0   3| 147M  788M| 194M  134M|   0     0
> > > > > > >>
> >
> > > > > > >    31  11  49   6   0   3| 157M  713M| 151M  104M|   0     0
> > > > > > >>
> >
> > > > > > >    39  14  38   5   0   4| 139M  836M| 246M  169M|   0     0
> > > > > > >>
> >
> > > > > > >    40  14  38   5   0   3| 139M  845M| 204M  140M|   0     0
> > > > > > >>
> >
> > > > > > >    40  14  37   5   0   4| 149M  743M| 210M  144M|   0     0
> > > > > > >>
> >
> > > > > > >    42  14  35   5   0   4| 143M  852M| 216M  150M|   0     0
> > > > > > >>
> >
> > > > > > > For example, what last entry is saying that system (with 8
> >
> > > > > > > osds) is
> >
> > > > > receiving 216M of data over network and in response to that it
> > > > > is
> >
> > > > > writing total of 852M of data and reading 143M of data. At this
> >
> > > > > time FIO on client side is reporting ~35K 4K RW iops.
> >
> > > > > > > Now, after a min or so, the throughput goes down to barely
> > > > > > > 1K
> >
> > > > > > > from FIO
> >
> > > > > (and very bumpy) and here is the 'dstat' snippet at that time..
> >
> > > > > > > ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
> >
> > > > > > > usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
> >
> > > > > > >     2   1  83  14   0   0| 220M   58M|4346k 3002k|   0     0
> > > > > > >>
> >
> > > > > > >     2   1  82  14   0   0| 223M   60M|4050k 2919k|   0     0
> > > > > > >>
> >
> > > > > > >     3   1  82  13   0   0| 217M   63M|6403k 4306k|   0     0
> > > > > > >>
> >
> > > > > > >     2   1  83  14   0   0| 226M   54M|2126k 1497k|   0     0
> > > > > > >>
> >
> > > > > > >
> >
> > > > > > > So, system is barely receiving anything (~2M) but still
> >
> > > > > > > writing ~54M of data
> >
> > > > > and reading 226M of data from disk.
> >
> > > > > > > After killing fio script , here is the 'dstat' output..
> >
> > > > > > >
> >
> > > > > > > ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
> >
> > > > > > > usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
> >
> > > > > > >     2   1  86  12   0   0| 186M   66M|  28k   26k|   0     0
> > > > > > >>
> >
> > > > > > >     2   1  86  12   0   0| 201M   78M|  20k   21k|   0     0
> > > > > > >>
> >
> > > > > > >     2   1  85  12   0   0| 230M  100M|  24k   24k|   0     0
> > > > > > >>
> >
> > > > > > >     2   1  85  12   0   0| 206M   78M|  21k   20k|   0     0
> > > > > > >>
> >
> > > > > > >
> >
> > > > > > > Not receiving anything from client but still writing 78M of
> >
> > > > > > > data and 206M
> >
> > > > > of read.
> >
> > > > > > > Clearly, it is an effect of rocksdb compaction that stalling
> >
> > > > > > > IO and even if we
> >
> > > > > increased compaction thread (and other tuning), compaction is
> > > > > not
> >
> > > > > able to keep up with incoming IO.
> >
> > > > > > > Thanks & Regards
> >
> > > > > > > Somnath
> >
> > > > > > >
> >
> > > > > > > -----Original Message-----
> >
> > > > > > > From: Allen Samuels
> >
> > > > > > > Sent: Friday, June 10, 2016 8:06 AM
> >
> > > > > > > To: Sage Weil
> >
> > > > > > > Cc: Somnath Roy; Mark Nelson; Manavalan Krishnan; Ceph
> >
> > > > > > > Development
> >
> > > > > > > Subject: RE: RocksDB tuning
> >
> > > > > > >
> >
> > > > > > > > -----Original Message-----
> >
> > > > > > > > From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> >
> > > > > > > > Sent: Friday, June 10, 2016 7:55 AM
> >
> > > > > > > > To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>
> >
> > > > > > > > Cc: Somnath Roy <Somnath.Roy@xxxxxxxxxxx>; Mark Nelson
> >
> > > > > > > > <mnelson@xxxxxxxxxx>; Manavalan Krishnan
> >
> > > > > > > > <Manavalan.Krishnan@xxxxxxxxxxx>; Ceph Development
> <ceph-
> >
> > > > > > > > devel@xxxxxxxxxxxxxxx>
> >
> > > > > > > > Subject: RE: RocksDB tuning
> >
> > > > > > > >
> >
> > > > > > > > On Fri, 10 Jun 2016, Allen Samuels wrote:
> >
> > > > > > > > > Checksums are definitely a part of the problem, but I
> >
> > > > > > > > > suspect the smaller part of the problem. This particular
> >
> > > > > > > > > use-case (random 4K overwrites without the WAL stuff) is
> >
> > > > > > > > > the worst-case from an encoding perspective and
> > > > > > > > > highlights
> >
> > > > > > > > > the inefficiency in the current
> >
> > > > > code.
> >
> > > > > > > > > As has been discussed earlier, a specialized
> > > > > > > > > encode/decode
> >
> > > > > > > > > implementation for these data structures is clearly
> > > > > > > > > called
> > for.
> >
> > > > > > > > >
> >
> > > > > > > > > IMO, you'll be able to cut the size of this by AT LEAST
> > > > > > > > > a
> >
> > > > > > > > > factor of
> >
> > > > > > > > > 3 or
> >
> > > > > > > > > 4 without a lot of effort. The price will be somewhat
> >
> > > > > > > > > increase CPU cost for the serialize/deserialize operation.
> >
> > > > > > > > >
> >
> > > > > > > > > If you think of this as an application-specific data
> >
> > > > > > > > > compression problem, here is a short list of potential
> >
> > > > > > > > > compression opportunities.
> >
> > > > > > > > >
> >
> > > > > > > > > (1) Encoded sizes and offsets are 8-byte byte values,
> >
> > > > > > > > > converting these too
> >
> > > > > > > > block values will drop 9 or 12 bits from each value. Also,
> >
> > > > > > > > the ranges for these values is usually only 2^22 -- often
> > > > > > > > much
> > less.
> >
> > > > > > > > Meaning that there's 3-5 bytes of zeros at the top of each
> >
> > > > > > > > word that can
> >
> > > > > be dropped.
> >
> > > > > > > > > (2) Encoded device addresses are often less than 2^32,
> >
> > > > > > > > > meaning there's 3-4
> >
> > > > > > > > bytes of zeros at the top of each word that can be dropped.
> >
> > > > > > > > >    (3) Encoded offsets and sizes are often exactly "1"
> >
> > > > > > > > > block, clever choices of
> >
> > > > > > > > formatting can eliminate these entirely.
> >
> > > > > > > > > IMO, an optimized encoded form of the extent table will
> > > > > > > > > be
> >
> > > > > > > > > around
> >
> > > > > > > > > 1/4 of the current encoding (for this use-case) and will
> >
> > > > > > > > > likely result in an Onode that's only 1/3 of the size
> > > > > > > > > that
> >
> > > > > > > > > Somnath is seeing.
> >
> > > > > > > > That will be true for the lextent and blob extent maps.
> > > > > > > > I'm
> >
> > > > > > > > guessing this is a small part of the ~5K somnath saw.  If
> >
> > > > > > > > his objects are 4MB then 4KB of it
> >
> > > > > > > > (80%) is the csum_data vector, which is a flat vector of
> >
> > > > > > > > u32 values that are presumably not very compressible.
> >
> > > > > > > I don't think that's what Somnath is seeing (obviously some
> >
> > > > > > > data here will
> >
> > > > > sharpen up our speculations). But in his use case, I believe
> > > > > that
> >
> > > > > he has a separate blob and pextent for each 4K write (since it's
> >
> > > > > been subjected to random 4K overwrites), that means somewhere in
> >
> > > > > the data structures at least one address and one length for each
> >
> > > > > of the 4K blocks (and likely much more in the lextent and blob
> >
> > > > > maps as you alluded to above). The encoding of just this
> >
> > > > > information alone is larger than the checksum data.
> >
> > > > > > > > We could perhaps break these into a separate key or keyspace..
> >
> > > > > > > > That'll give rocksdb a bit more computation work to do
> > > > > > > > (for
> >
> > > > > > > > a custom merge operator, probably, to update just a piece
> > > > > > > > of
> >
> > > > > > > > the value) but for a 4KB value I'm not sure it's big
> > > > > > > > enough
> >
> > > > > > > > to really help.  Also we'd lose locality, would need a
> >
> > > > > > > > second get to load csum metadata on
> >
> > > > > read, etc.
> >
> > > > > > > > :/  I don't really have any good ideas here.
> >
> > > > > > > >
> >
> > > > > > > > sage
> >
> > > > > > > >
> >
> > > > > > > >
> >
> > > > > > > > > Allen Samuels
> >
> > > > > > > > > SanDisk |a Western Digital brand
> >
> > > > > > > > > 2880 Junction Avenue, Milpitas, CA 95134
> >
> > > > > > > > > T: +1 408 801 7030| M: +1 408 780 6416
> >
> > > > > > > > > allen.samuels@xxxxxxxxxxx
> >
> > > > > > > > >
> >
> > > > > > > > >
> >
> > > > > > > > > > -----Original Message-----
> >
> > > > > > > > > > From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> >
> > > > > > > > > > Sent: Friday, June 10, 2016 2:35 AM
> >
> > > > > > > > > > To: Somnath Roy <Somnath.Roy@xxxxxxxxxxx>
> >
> > > > > > > > > > Cc: Mark Nelson <mnelson@xxxxxxxxxx>; Allen Samuels
> >
> > > > > > > > > > <Allen.Samuels@xxxxxxxxxxx>; Manavalan Krishnan
> >
> > > > > > > > > > <Manavalan.Krishnan@xxxxxxxxxxx>; Ceph Development
> >
> > > > > > > > > > <ceph- devel@xxxxxxxxxxxxxxx>
> >
> > > > > > > > > > Subject: RE: RocksDB tuning
> >
> > > > > > > > > >
> >
> > > > > > > > > > On Fri, 10 Jun 2016, Somnath Roy wrote:
> >
> > > > > > > > > > > Sage/Mark,
> >
> > > > > > > > > > > I debugged the code and it seems there is no WAL
> > > > > > > > > > > write
> >
> > > > > > > > > > > going on and
> >
> > > > > > > > > > working as expected. But, in the process, I found that
> >
> > > > > > > > > > onode size it is
> >
> > > > > > > > writing
> >
> > > > > > > > > > to my environment ~7K !! See this debug print.
> >
> > > > > > > > > > > 2016-06-09 15:49:24.710149 7f7732fe3700 20
> >
> > > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-0)   onode
> >
> > > > > > > > > > #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:h
> > > > > > > > > > ea
> >
> > > > > > > > > > d# is
> >
> > > > > 7518
> >
> > > > > > > > > > > This explains why so much data going to rocksdb I
> >
> > > > > > > > > > > guess. Once compaction kicks in iops I am getting is
> > > > > > > > > > > *30
> > times* slower.
> >
> > > > > > > > > > >
> >
> > > > > > > > > > > I have 15 osds on 8TB drives and I have created 4TB
> >
> > > > > > > > > > > rbd image preconditioned with 1M. I was running 4K
> > > > > > > > > > > RW
> > test.
> >
> > > > > > > > > > The onode is big because of the csum metdata.  Try
> >
> > > > > > > > > > setting 'bluestore
> >
> > > > > > > > csum
> >
> > > > > > > > > > type = none' and see if that is the entire reason or
> > > > > > > > > > if
> >
> > > > > > > > > > something else is
> >
> > > > > > > > going
> >
> > > > > > > > > > on.
> >
> > > > > > > > > >
> >
> > > > > > > > > > We may need to reconsider the way this is stored.
> >
> > > > > > > > > >
> >
> > > > > > > > > > s
> >
> > > > > > > > > >
> >
> > > > > > > > > >
> >
> > > > > > > > > >
> >
> > > > > > > > > >
> >
> > > > > > > > > > > Thanks & Regards
> >
> > > > > > > > > > > Somnath
> >
> > > > > > > > > > >
> >
> > > > > > > > > > > -----Original Message-----
> >
> > > > > > > > > > > From: ceph-devel-owner@xxxxxxxxxxxxxxx
> >
> > > > > > > > > > > [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf
> > > > > > > > > > > Of
> >
> > > > > > > > > > > Somnath
> >
> > > > > > > > Roy
> >
> > > > > > > > > > > Sent: Thursday, June 09, 2016 8:23 AM
> >
> > > > > > > > > > > To: Mark Nelson; Allen Samuels; Manavalan Krishnan;
> >
> > > > > > > > > > > Ceph
> >
> > > > > > > > Development
> >
> > > > > > > > > > > Subject: RE: RocksDB tuning
> >
> > > > > > > > > > >
> >
> > > > > > > > > > > Mark,
> >
> > > > > > > > > > > As we discussed, it seems there is ~5X write amp on
> >
> > > > > > > > > > > the system with 4K
> >
> > > > > > > > > > RW. Considering the amount of data going into rocksdb
> >
> > > > > > > > > > (and thus kicking
> >
> > > > > > > > of
> >
> > > > > > > > > > compaction so fast and degrading performance
> >
> > > > > > > > > > drastically) , it seems it is
> >
> > > > > > > > still
> >
> > > > > > > > > > writing WAL (?)..I used the following rocksdb option
> > > > > > > > > > for
> >
> > > > > > > > > > faster
> >
> > > > > > > > background
> >
> > > > > > > > > > compaction as well hoping it can keep up with upcoming
> >
> > > > > > > > > > writes and
> >
> > > > > > > > writes
> >
> > > > > > > > > > won't be stalling. But, eventually, after a min or so,
> >
> > > > > > > > > > it is stalling io..
> >
> > > > > > > > > > > bluestore_rocksdb_options =
> >
> > > > >
> "compression=kNoCompression,max_write_buffer_number=16,min_write
> > > > > _
> >
> > > > >
> buffer_number_to_merge=3,recycle_log_file_num=16,compaction_styl
> > > > > e=
> >
> > > > > k
> >
> > > > > > > > CompactionStyleLevel,write_buffer_size=67108864,target_fil
> > > > > > > > e_
> >
> > > > > > > > size_bas
> >
> > > > > > > > e=6
> >
> > > > > > > >
> >
> > > > >
> 7108864,max_background_compactions=31,level0_file_num_compaction
> > > > > _t
> >
> > > > > ri
> >
> > > > > gger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trig
> > > > > ge
> >
> > > > > r=
> >
> > > > > > > > 64,
> >
> > > > > > > >
> >
> > > > >
> num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_le
> > > > > ve
> >
> > > > > l
> >
> > > > > > > > > > _multiplier=8,compaction_threads=32,flusher_threads=8"
> >
> > > > > > > > > > > I will try to debug what is going on there..
> >
> > > > > > > > > > >
> >
> > > > > > > > > > > Thanks & Regards
> >
> > > > > > > > > > > Somnath
> >
> > > > > > > > > > >
> >
> > > > > > > > > > > -----Original Message-----
> >
> > > > > > > > > > > From: ceph-devel-owner@xxxxxxxxxxxxxxx
> >
> > > > > > > > > > > [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf
> > > > > > > > > > > Of
> >
> > > > > > > > > > > Mark Nelson
> >
> > > > > > > > > > > Sent: Thursday, June 09, 2016 6:46 AM
> >
> > > > > > > > > > > To: Allen Samuels; Manavalan Krishnan; Ceph
> >
> > > > > > > > > > > Development
> >
> > > > > > > > > > > Subject: Re: RocksDB tuning
> >
> > > > > > > > > > >
> >
> > > > > > > > > > > On 06/09/2016 08:37 AM, Mark Nelson wrote:
> >
> > > > > > > > > > > > Hi Allen,
> >
> > > > > > > > > > > >
> >
> > > > > > > > > > > > On a somewhat related note, I wanted to mention
> > > > > > > > > > > > that
> >
> > > > > > > > > > > > I had
> >
> > > > > > > > forgotten
> >
> > > > > > > > > > > > that chhabaremesh's min_alloc_size commit for
> >
> > > > > > > > > > > > different media types was committed into master:
> >
> > > > > > > > > > > >
> >
> > > > > > > > > > > >
> >
> > > > >
> https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611d
> > > > > c3
> >
> > > > > 35
> >
> > > > > > > > > > > > e3
> >
> > > > > > > > > > > > efd187
> >
> > > > > > > > > > > >
> >
> > > > > > > > > > > >
> >
> > > > > > > > > > > > IE those tests appear to already have been using a
> >
> > > > > > > > > > > > 4K min alloc size due to non-rotational NVMe media.
> >
> > > > > > > > > > > > I went back and verified that explicitly changing
> >
> > > > > > > > > > > > the min_alloc size (in fact all of them to be
> >
> > > > > > > > > > > > sure) to 4k does not change the behavior from
> > > > > > > > > > > > graphs
> >
> > > > > > > > > > > > I showed yesterday.  The rocksdb compaction stalls
> >
> > > > > > > > > > > > due to excessive reads appear (at least on the
> >
> > > > > > > > > > > > surface) to be due to metadata traffic during
> > > > > > > > > > > > heavy
> >
> > > > > > > > > > > > small random
> >
> > > > > > > > writes.
> >
> > > > > > > > > > > Sorry, this was worded poorly.  Traffic due to
> >
> > > > > > > > > > > compaction of metadata
> >
> > > > > > > > (ie
> >
> > > > > > > > > > not leaked WAL data) during small random writes.
> >
> > > > > > > > > > > Mark
> >
> > > > > > > > > > >
> >
> > > > > > > > > > > > Mark
> >
> > > > > > > > > > > >
> >
> > > > > > > > > > > > On 06/08/2016 06:52 PM, Allen Samuels wrote:
> >
> > > > > > > > > > > > > Let's make a patch that creates actual Ceph
> >
> > > > > > > > > > > > > parameters for these things so that we don't
> > > > > > > > > > > > > have
> >
> > > > > > > > > > > > > to edit the source code in the
> >
> > > > > future.
> >
> > > > > > > > > > > > >
> >
> > > > > > > > > > > > > Allen Samuels
> >
> > > > > > > > > > > > > SanDisk |a Western Digital brand
> >
> > > > > > > > > > > > > 2880 Junction Avenue, San Jose, CA 95134
> >
> > > > > > > > > > > > > T: +1 408 801 7030| M: +1 408 780 6416
> >
> > > > > > > > > > > > > allen.samuels@xxxxxxxxxxx
> >
> > > > > > > > > > > > >
> >
> > > > > > > > > > > > >
> >
> > > > > > > > > > > > > > -----Original Message-----
> >
> > > > > > > > > > > > > > From: ceph-devel-owner@xxxxxxxxxxxxxxx
> >
> > > > > > > > > > > > > > [mailto:ceph-devel- owner@xxxxxxxxxxxxxxx] On
> >
> > > > > > > > > > > > > > Behalf Of Manavalan Krishnan
> >
> > > > > > > > > > > > > > Sent: Wednesday, June 08, 2016 3:10 PM
> >
> > > > > > > > > > > > > > To: Mark Nelson <mnelson@xxxxxxxxxx>; Ceph
> >
> > > > > > > > > > > > > > Development
> >
> > > > > > > > <ceph-
> >
> > > > > > > > > > > > > > devel@xxxxxxxxxxxxxxx>
> >
> > > > > > > > > > > > > > Subject: RocksDB tuning
> >
> > > > > > > > > > > > > >
> >
> > > > > > > > > > > > > > Hi Mark
> >
> > > > > > > > > > > > > >
> >
> > > > > > > > > > > > > > Here are the tunings that we used to avoid the
> >
> > > > > > > > > > > > > > IOPs choppiness caused by rocksdb compaction.
> >
> > > > > > > > > > > > > >
> >
> > > > > > > > > > > > > > We need to add the following options in
> >
> > > > > > > > > > > > > > src/kv/RocksDBStore.cc before
> > > > > > > > > > > > > > rocksdb::DB::Open
> >
> > > > > > > > > > > > > > in RocksDBStore::do_open
> >
> > > > > > > > > > opt.IncreaseParallelism(16);
> >
> > > > > > > > > > > > > >     opt.OptimizeLevelStyleCompaction(512 *
> > > > > > > > > > > > > >1024
> >
> > > > > > > > > > > > > > * 1024);
> >
> > > > > > > > > > > > > >
> >
> > > > > > > > > > > > > >
> >
> > > > > > > > > > > > > >
> >
> > > > > > > > > > > > > > Thanks
> >
> > > > > > > > > > > > > > Mana
> >
> > > > > > > > > > > > > >
> >
> > > > > > > > > > > > > >
> >
> > > > > > > > > > > > > > PLEASE NOTE: The information contained in this
> >
> > > > > > > > > > > > > > electronic mail message is intended only for
> > > > > > > > > > > > > > the
> >
> > > > > > > > > > > > > > use of the designated
> >
> > > > > > > > > > > > > > recipient(s) named above.
> >
> > > > > > > > > > > > > > If the
> >
> > > > > > > > > > > > > > reader of this message is not the intended
> >
> > > > > > > > > > > > > > recipient, you are hereby notified that you
> > > > > > > > > > > > > > have
> >
> > > > > > > > > > > > > > received this message in error and that any
> >
> > > > > > > > > > > > > > review, dissemination, distribution, or
> > > > > > > > > > > > > > copying
> >
> > > > > > > > > > > > > > of this message is strictly prohibited. If you
> >
> > > > > > > > > > > > > > have received this communication in error,
> >
> > > > > > > > > > > > > > please notify the sender by telephone or
> > > > > > > > > > > > > > e-mail
> >
> > > > > > > > > > > > > > (as shown
> >
> > > > > > > > > > > > > > above) immediately and destroy any and all
> >
> > > > > > > > > > > > > > copies of this message in your possession
> >
> > > > > > > > > > > > > > (whether hard copies or electronically stored
> >
> > > > > > > > > > > > > > copies).
> >
> > > > > > > > > > > > > > --
> >
> > > > > > > > > > > > > > To unsubscribe from this list: send the line
> >
> > > > > > > > > > > > > > "unsubscribe
> >
> > > > > > > > > > > > > > ceph-
> >
> > > > > > > > devel"
> >
> > > > > > > > > > > > > > in the
> >
> > > > > > > > > > > > > > body of a message to majordomo@xxxxxxxxxxxxxxx
> >
> > > > > > > > > > > > > > More
> >
> > > > > > > > majordomo
> >
> > > > > > > > > > info
> >
> > > > > > > > > > > > > > at http://vger.kernel.org/majordomo-info.html
> >
> > > > > > > > > > > > > --
> >
> > > > > > > > > > > > > To unsubscribe from this list: send the line
> >
> > > > > > > > > > > > > "unsubscribe
> >
> > > > > > > > > > > > > ceph-
> >
> > > > > devel"
> >
> > > > > > > > > > > > > in the body of a message to
> >
> > > > > > > > > > > > > majordomo@xxxxxxxxxxxxxxx More majordomo info at
> >
> > > > > > > > > > > > > http://vger.kernel.org/majordomo-info.html
> >
> > > > > > > > > > > > >
> >
> > > > > > > > > > > > --
> >
> > > > > > > > > > > > To unsubscribe from this list: send the line
> >
> > > > > > > > > > > > "unsubscribe
> >
> > > > > > > > > > > > ceph-
> >
> > > > > devel"
> >
> > > > > > > > > > > > in the body of a message to
> >
> > > > > > > > > > > > majordomo@xxxxxxxxxxxxxxx More
> >
> > > > > > > > > > majordomo
> >
> > > > > > > > > > > > info at
> > > > > > > > > > > > http://vger.kernel.org/majordomo-info.html
> >
> > > > > > > > > > > --
> >
> > > > > > > > > > > To unsubscribe from this list: send the line
> >
> > > > > > > > > > > "unsubscribe ceph-devel"
> >
> > > > > > > > > > > in the body of a message to
> > > > > > > > > > > majordomo@xxxxxxxxxxxxxxx
> >
> > > > > > > > > > > More
> >
> > > > > > > > > > majordomo
> >
> > > > > > > > > > > info at  http://vger.kernel.org/majordomo-info.html
> >
> > > > > > > > > > > PLEASE NOTE: The information contained in this
> >
> > > > > > > > > > > electronic mail message
> >
> > > > > > > > is
> >
> > > > > > > > > > intended only for the use of the designated
> > > > > > > > > > recipient(s)
> >
> > > > > > > > > > named above. If
> >
> > > > > > > > the
> >
> > > > > > > > > > reader of this message is not the intended recipient,
> >
> > > > > > > > > > you are hereby
> >
> > > > > > > > notified
> >
> > > > > > > > > > that you have received this message in error and that
> >
> > > > > > > > > > any review, dissemination, distribution, or copying of
> >
> > > > > > > > > > this message is strictly
> >
> > > > > > > > prohibited. If
> >
> > > > > > > > > > you have received this communication in error, please
> >
> > > > > > > > > > notify the sender
> >
> > > > > > > > by
> >
> > > > > > > > > > telephone or e-mail (as shown above) immediately and
> >
> > > > > > > > > > destroy any and
> >
> > > > > > > > all
> >
> > > > > > > > > > copies of this message in your possession (whether
> > > > > > > > > > hard
> >
> > > > > > > > > > copies or electronically stored copies).
> >
> > > > > > > > > > > --
> >
> > > > > > > > > > > To unsubscribe from this list: send the line
> >
> > > > > > > > > > > "unsubscribe ceph-devel"
> >
> > > > > > > > > > > in the body of a message to
> > > > > > > > > > > majordomo@xxxxxxxxxxxxxxx
> >
> > > > > > > > > > > More
> >
> > > > > > > > > > majordomo
> >
> > > > > > > > > > > info at  http://vger.kernel.org/majordomo-info.html
> >
> > > > > > > > > > > --
> >
> > > > > > > > > > > To unsubscribe from this list: send the line
> >
> > > > > > > > > > > "unsubscribe ceph-devel"
> >
> > > > > > > > > > > in the body of a message to
> > > > > > > > > > > majordomo@xxxxxxxxxxxxxxx
> >
> > > > > > > > > > > More
> >
> > > > > > > > > > majordomo
> >
> > > > > > > > > > > info at  http://vger.kernel.org/majordomo-info.html
> >
> > > > > > > > > > >
> >
> > > > > > > > > > >
> >
> > > > > > > > > --
> >
> > > > > > > > > To unsubscribe from this list: send the line
> > > > > > > > > "unsubscribe
> >
> > > > > > > > > ceph-devel" in the body of a message to
> >
> > > > > > > > > majordomo@xxxxxxxxxxxxxxx More majordomo info at
> >
> > > > > > > > > http://vger.kernel.org/majordomo-info.html
> >
> > > > > > > > >
> >
> > > > > > > > >
> >
> > > > > > > PLEASE NOTE: The information contained in this electronic
> > > > > > > mail
> >
> > > > > > > message is
> >
> > > > > intended only for the use of the designated recipient(s) named
> >
> > > > > above. If the reader of this message is not the intended
> >
> > > > > recipient, you are hereby notified that you have received this
> >
> > > > > message in error and that any review, dissemination,
> > > > > distribution,
> >
> > > > > or copying of this message is strictly prohibited. If you have
> >
> > > > > received this communication in error, please notify the sender
> > > > > by
> >
> > > > > telephone or e-mail (as shown above) immediately and destroy any
> >
> > > > > and all copies of this message in your possession (whether hard
> >
> > > > > copies or electronically stored copies).
> >
> > > > > > >
> >
> > > > > > --
> >
> > > > > > To unsubscribe from this list: send the line "unsubscribe
> > ceph-devel"
> >
> > > > > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> >
> > > > > majordomo
> >
> > > > > > info at  http://vger.kernel.org/majordomo-info.html
> >
> > >
> >
> > > --
> >
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >
> > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> majordomo
> >
> > > info at  http://vger.kernel.org/majordomo-info.html
> >
> > >
> >
> > >
> >
> > PLEASE NOTE: The information contained in this electronic mail message
> > is intended only for the use of the designated recipient(s) named
> > above. If the reader of this message is not the intended recipient,
> > you are hereby notified that you have received this message in error
> > and that any review, dissemination, distribution, or copying of this
> > message is strictly prohibited. If you have received this
> > communication in error, please notify the sender by telephone or
> > e-mail (as shown above) immediately and destroy any and all copies of
> > this message in your possession (whether hard copies or electronically
> stored copies).
> >
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f