Re: RocksDB tuning

Igor Fedotov <ifedotov@xxxxxxxxxxxx> · Tue, 14 Jun 2016 14:11:40 +0300

I was talking about my local environment where I ran the test case. I 
have min 64K for the blob here. Hence I assume max 64 blobs per 4M.

On 10.06.2016 20:13, Allen Samuels wrote:
What's the assumption that suggests a limit of 64 blobs / 4MB ? Are you assuming a 64K blobsize?? That certainly won't be the case for flash.

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@xxxxxxxxxxx

-----Original Message-----
From: Igor Fedotov [mailto:ifedotov@xxxxxxxxxxxx]
Sent: Friday, June 10, 2016 9:51 AM
To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Sage Weil
<sweil@xxxxxxxxxx>; Somnath Roy <Somnath.Roy@xxxxxxxxxxx>
Cc: Mark Nelson <mnelson@xxxxxxxxxx>; Manavalan Krishnan
<Manavalan.Krishnan@xxxxxxxxxxx>; Ceph Development <ceph-
devel@xxxxxxxxxxxxxxx>
Subject: Re: RocksDB tuning

An update:

I found that my previous results were invalid - SyntheticWorkloadState had
an odd swap for offset > len case... Made a brief fix.

Now onode size with csum raises up to 38K, without csum - 28K.

For csum case there is 350 lextents and about 170 blobs

For no csum - 343 lextents and about 170 blobs.

(blobs counting is very inaccurate!)

Potentially we shouldn't have >64 blobs per 4M thus looks like some issues in
the write path...

And CSum vs. NoCsum differenct looks pretty consistent - 170 blobs * 4 byte
* 16 values = 10880

Branch's @github been updated with corresponding fixes.

Thanks,
Igor.

On 10.06.2016 19:06, Allen Samuels wrote:
Let's see, 4MB is 2^22 bytes. If we storage a checksum for each 2^12 bytes
that's 2^10 checksums at 2^2 bytes each is 2^12 bytes.
So with optimal encoding, the checksum baggage shouldn't be more than
4KB per oNode.
But you're seeing 13K as the upper bound on the onode size.

In the worst case, you'll need at least another block address (8 bytes
currently) and length (another 8 bytes) [though as I point out, the length is
something that can be optimized out] So worst case, this encoding would be
an addition 16KB per onode.
I suspect you're not at the worst-case yet :)

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx

-----Original Message-----
From: Igor Fedotov [mailto:ifedotov@xxxxxxxxxxxx]
Sent: Friday, June 10, 2016 8:58 AM
To: Sage Weil <sweil@xxxxxxxxxx>; Somnath Roy
<Somnath.Roy@xxxxxxxxxxx>
Cc: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Mark Nelson
<mnelson@xxxxxxxxxx>; Manavalan Krishnan
<Manavalan.Krishnan@xxxxxxxxxxx>; Ceph Development <ceph-
devel@xxxxxxxxxxxxxxx>
Subject: Re: RocksDB tuning

Just modified store_test synthetic test case to simulate many random 4K
writes to 4M object.

With default settings ( crc32c + 4K block) onode size varies from 2K to
~13K
with disabled crc it's ~500 - 1300 bytes.

Hence the root cause seems to be in csum array.

Here is the updated branch:

https://github.com/ifed01/ceph/tree/wip-bluestore-test-size

Thanks,

Igor

On 10.06.2016 18:40, Sage Weil wrote:
On Fri, 10 Jun 2016, Somnath Roy wrote:
Just turning off checksum with the below param is not helping, I
still need to see the onode size though by enabling debug..Do I need
to mkfs
(Sage?) as it is still holding checksum of old data I wrote ?
Yeah.. you'll need to mkfs to blow away the old onodes and blobs with
csum data.

As Allen pointed out, this is only part of the problem.. but I'm
curious how much!

           bluestore_csum = false
           bluestore_csum_type = none

Here is the snippet of 'dstat'..

----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
    41  14  36   5   0   4| 138M  841M| 212M  145M|   0     0 >
    42  14  35   5   0   4| 137M  855M| 213M  147M|   0     0 >
    40  14  38   5   0   3| 143M  815M| 209M  144M|   0     0 >
    40  14  38   5   0   3| 137M  933M| 194M  134M|   0     0 >
    42  15  34   5   0   4| 133M  918M| 220M  151M|   0     0 >
    35  13  43   6   0   3| 147M  788M| 194M  134M|   0     0 >
    31  11  49   6   0   3| 157M  713M| 151M  104M|   0     0 >
    39  14  38   5   0   4| 139M  836M| 246M  169M|   0     0 >
    40  14  38   5   0   3| 139M  845M| 204M  140M|   0     0 >
    40  14  37   5   0   4| 149M  743M| 210M  144M|   0     0 >
    42  14  35   5   0   4| 143M  852M| 216M  150M|   0     0 >
For example, what last entry is saying that system (with 8 osds) is
receiving 216M of data over network and in response to that it is writing
total
of 852M of data and reading 143M of data. At this time FIO on client side is
reporting ~35K 4K RW iops.
Now, after a min or so, the throughput goes down to barely 1K from
FIO
(and very bumpy) and here is the 'dstat' snippet at that time..
----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
     2   1  83  14   0   0| 220M   58M|4346k 3002k|   0     0 >
     2   1  82  14   0   0| 223M   60M|4050k 2919k|   0     0 >
     3   1  82  13   0   0| 217M   63M|6403k 4306k|   0     0 >
     2   1  83  14   0   0| 226M   54M|2126k 1497k|   0     0 >

So, system is barely receiving anything (~2M) but still writing ~54M of
data
and reading 226M of data from disk.
After killing fio script , here is the 'dstat' output..

----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
     2   1  86  12   0   0| 186M   66M|  28k   26k|   0     0 >
     2   1  86  12   0   0| 201M   78M|  20k   21k|   0     0 >
     2   1  85  12   0   0| 230M  100M|  24k   24k|   0     0 >
     2   1  85  12   0   0| 206M   78M|  21k   20k|   0     0 >

Not receiving anything from client but still writing 78M of data and
206M
of read.
Clearly, it is an effect of rocksdb compaction that stalling IO and even if
we
increased compaction thread (and other tuning), compaction is not able to
keep up with incoming IO.
Thanks & Regards
Somnath

-----Original Message-----
From: Allen Samuels
Sent: Friday, June 10, 2016 8:06 AM
To: Sage Weil
Cc: Somnath Roy; Mark Nelson; Manavalan Krishnan; Ceph
Development
Subject: RE: RocksDB tuning

-----Original Message-----
From: Sage Weil [mailto:sweil@xxxxxxxxxx]
Sent: Friday, June 10, 2016 7:55 AM
To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>
Cc: Somnath Roy <Somnath.Roy@xxxxxxxxxxx>; Mark Nelson
<mnelson@xxxxxxxxxx>; Manavalan Krishnan
<Manavalan.Krishnan@xxxxxxxxxxx>; Ceph Development <ceph-
devel@xxxxxxxxxxxxxxx>
Subject: RE: RocksDB tuning

On Fri, 10 Jun 2016, Allen Samuels wrote:
Checksums are definitely a part of the problem, but I suspect the
smaller part of the problem. This particular use-case (random 4K
overwrites without the WAL stuff) is the worst-case from an
encoding perspective and highlights the inefficiency in the current
code.
As has been discussed earlier, a specialized encode/decode
implementation for these data structures is clearly called for.

IMO, you'll be able to cut the size of this by AT LEAST a factor of
3 or
4 without a lot of effort. The price will be somewhat increase CPU
cost for the serialize/deserialize operation.

If you think of this as an application-specific data compression
problem, here is a short list of potential compression opportunities.

(1) Encoded sizes and offsets are 8-byte byte values, converting
these too
block values will drop 9 or 12 bits from each value. Also, the
ranges for these values is usually only 2^22 -- often much less.
Meaning that there's 3-5 bytes of zeros at the top of each word that
can
be dropped.
(2) Encoded device addresses are often less than 2^32, meaning
there's 3-4
bytes of zeros at the top of each word that can be dropped.
    (3) Encoded offsets and sizes are often exactly "1" block, clever
choices of
formatting can eliminate these entirely.
IMO, an optimized encoded form of the extent table will be around
1/4 of the current encoding (for this use-case) and will likely
result in an Onode that's only 1/3 of the size that Somnath is seeing.
That will be true for the lextent and blob extent maps.  I'm
guessing this is a small part of the ~5K somnath saw.  If his
objects are 4MB then 4KB of it
(80%) is the csum_data vector, which is a flat vector of
u32 values that are presumably not very compressible.
I don't think that's what Somnath is seeing (obviously some data here
will
sharpen up our speculations). But in his use case, I believe that he has a
separate blob and pextent for each 4K write (since it's been subjected to
random 4K overwrites), that means somewhere in the data structures at
least one address and one length for each of the 4K blocks (and likely
much
more in the lextent and blob maps as you alluded to above). The encoding
of
just this information alone is larger than the checksum data.
We could perhaps break these into a separate key or keyspace..
That'll give rocksdb a bit more computation work to do (for a custom
merge operator, probably, to update just a piece of the value) but
for a 4KB value I'm not sure it's big enough to really help.  Also
we'd lose locality, would need a second get to load csum metadata on
read, etc.
:/  I don't really have any good ideas here.

sage

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx

-----Original Message-----
From: Sage Weil [mailto:sweil@xxxxxxxxxx]
Sent: Friday, June 10, 2016 2:35 AM
To: Somnath Roy <Somnath.Roy@xxxxxxxxxxx>
Cc: Mark Nelson <mnelson@xxxxxxxxxx>; Allen Samuels
<Allen.Samuels@xxxxxxxxxxx>; Manavalan Krishnan
<Manavalan.Krishnan@xxxxxxxxxxx>; Ceph Development <ceph-
devel@xxxxxxxxxxxxxxx>
Subject: RE: RocksDB tuning

On Fri, 10 Jun 2016, Somnath Roy wrote:
Sage/Mark,
I debugged the code and it seems there is no WAL write going on
and
working as expected. But, in the process, I found that onode size
it is
writing
to my environment ~7K !! See this debug print.
2016-06-09 15:49:24.710149 7f7732fe3700 20
bluestore(/var/lib/ceph/osd/ceph-0)   onode
#1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is
7518
This explains why so much data going to rocksdb I guess. Once
compaction kicks in iops I am getting is *30 times* slower.

I have 15 osds on 8TB drives and I have created 4TB rbd image
preconditioned with 1M. I was running 4K RW test.
The onode is big because of the csum metdata.  Try setting
'bluestore
csum
type = none' and see if that is the entire reason or if something
else is
going
on.

We may need to reconsider the way this is stored.

s

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx
[mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of
Somnath
Roy
Sent: Thursday, June 09, 2016 8:23 AM
To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph
Development
Subject: RE: RocksDB tuning

Mark,
As we discussed, it seems there is ~5X write amp on the system
with 4K
RW. Considering the amount of data going into rocksdb (and thus
kicking
of
compaction so fast and degrading performance drastically) , it
seems it is
still
writing WAL (?)..I used the following rocksdb option for faster
background
compaction as well hoping it can keep up with upcoming writes and
writes
won't be stalling. But, eventually, after a min or so, it is stalling io..
bluestore_rocksdb_options =
"compression=kNoCompression,max_write_buffer_number=16,min_write_
buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=k
CompactionStyleLevel,write_buffer_size=67108864,target_file_size_bas
e=6

7108864,max_background_compactions=31,level0_file_num_compaction_tri
gger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=
64,

num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level
_multiplier=8,compaction_threads=32,flusher_threads=8"
I will try to debug what is going on there..

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx
[mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Mark
Nelson
Sent: Thursday, June 09, 2016 6:46 AM
To: Allen Samuels; Manavalan Krishnan; Ceph Development
Subject: Re: RocksDB tuning

On 06/09/2016 08:37 AM, Mark Nelson wrote:
Hi Allen,

On a somewhat related note, I wanted to mention that I had
forgotten
that chhabaremesh's min_alloc_size commit for different media
types was committed into master:

https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335
e3
efd187

IE those tests appear to already have been using a 4K min alloc
size due to non-rotational NVMe media.  I went back and
verified
that explicitly changing the min_alloc size (in fact all of them
to be
sure) to 4k does not change the behavior from graphs I showed
yesterday.  The rocksdb compaction stalls due to excessive reads
appear (at least on the
surface) to be due to metadata traffic during heavy small
random
writes.
Sorry, this was worded poorly.  Traffic due to compaction of
metadata
(ie
not leaked WAL data) during small random writes.
Mark

Mark

On 06/08/2016 06:52 PM, Allen Samuels wrote:
Let's make a patch that creates actual Ceph parameters for
these things so that we don't have to edit the source code in
the
future.
Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@xxxxxxxxxxx

-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-
devel-
owner@xxxxxxxxxxxxxxx] On Behalf Of Manavalan Krishnan
Sent: Wednesday, June 08, 2016 3:10 PM
To: Mark Nelson <mnelson@xxxxxxxxxx>; Ceph
Development
<ceph-
devel@xxxxxxxxxxxxxxx>
Subject: RocksDB tuning

Hi Mark

Here are the tunings that we used to avoid the IOPs
choppiness
caused by rocksdb compaction.

We need to add the following options in
src/kv/RocksDBStore.cc
before rocksdb::DB::Open in RocksDBStore::do_open
opt.IncreaseParallelism(16);
     opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);

Thanks
Mana

PLEASE NOTE: The information contained in this electronic
mail
message is intended only for the use of the designated
recipient(s) named above.
If the
reader of this message is not the intended recipient, you are
hereby notified that you have received this message in error
and that any review, dissemination, distribution, or copying
of this message is strictly prohibited. If you have received
this communication in error, please notify the sender by
telephone or e-mail (as shown
above) immediately and destroy any and all copies of this
message in your possession (whether hard copies or
electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe
ceph-
devel"
in the
body of a message to majordomo@xxxxxxxxxxxxxxx More
majordomo
info
at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-
devel"
in the body of a message to majordomo@xxxxxxxxxxxxxxx More
majordomo info at  http://vger.kernel.org/majordomo-
info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-
devel"
in the body of a message to majordomo@xxxxxxxxxxxxxxx More
majordomo
info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-
devel"
in the body of a message to majordomo@xxxxxxxxxxxxxxx More
majordomo
info at  http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail
message
is
intended only for the use of the designated recipient(s) named
above. If
the
reader of this message is not the intended recipient, you are
hereby
notified
that you have received this message in error and that any review,
dissemination, distribution, or copying of this message is
strictly
prohibited. If
you have received this communication in error, please notify the
sender
by
telephone or e-mail (as shown above) immediately and destroy
any
and
all
copies of this message in your possession (whether hard copies or
electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-
devel"
in the body of a message to majordomo@xxxxxxxxxxxxxxx More
majordomo
info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-
devel"
in the body of a message to majordomo@xxxxxxxxxxxxxxx More
majordomo
info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe
ceph-devel" in the body of a message to
majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-
info.html

PLEASE NOTE: The information contained in this electronic mail
message is
intended only for the use of the designated recipient(s) named above. If
the
reader of this message is not the intended recipient, you are hereby
notified
that you have received this message in error and that any review,
dissemination, distribution, or copying of this message is strictly
prohibited. If
you have received this communication in error, please notify the sender
by
telephone or e-mail (as shown above) immediately and destroy any and
all
copies of this message in your possession (whether hard copies or
electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in the body of a message to majordomo@xxxxxxxxxxxxxxx More
majordomo
info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html