Re: Bluestore performance bottleneck

Mark Nelson <mnelson@xxxxxxxxxx> · Thu, 22 Dec 2016 18:27:19 -0600

On 12/22/2016 04:36 PM, Sage Weil wrote:
On Thu, 22 Dec 2016, Mark Nelson wrote:
Hi Somnath,

Based on your testing, I went through and did some single OSD tests with
master (pre-extent patch) with different sharding target/max settings on one
of our NVMe nodes:

https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZb2ZWcVZHbzJRVHc

What I saw is that for 4k min_alloc/max_alloc/max_blob sizes, decreasing the
sharding target/max helped to a point where it started hurting more than it
helped.  The peak is probably somewhere between 100/200 and 200/400, though we
may want to error on higher values rather than lower.  RSS memory usage of the
OSD increased dramatically as the target/max sizes shrunk.  CPU usage didn't
change dramatically, though was a little lower at the extremes where
performance was lowest.

For reference, 16k min_alloc pegs at around 20K IOPS in this test as well,
meaning that I think we may be hitting a common bottleneck holding us to 20K
write IOPS per OSD.

I noticed that as the target/max size shrunk, certain code paths became more
heavily worked however.  RocksDB generally took about a 2x larger percentage
of the used CPU, with a lot of it going toward CRC calculations.  We also
spent a lot more time in BlueStore::ExtentMap::init_shards doing key appends,

We can probably drop the precomputation of shard keys.  Or, keep the
std::string there, and do it as-needed.  Probably drop it entirely,
though, since it's just going to be the object key copy (usually
less than 100 bytes).

Try this?
	https://github.com/ceph/ceph/pull/12634

Looks like this is most likely reducing the memory usage and increasing 
performance quite a bit with smaller shard target/max values.  With 
25/50 I'm seeing more like 2.6GB RSS memory usage and around 13K iops 
typically with some (likely rocksdb) stalls.  I'll run through the tests 
again.

Mark

sage

and triming the TwoQCache. Given that the IOPS dropped precipitously, while
overall CPU usage remained high and memory usage increased dramatically, there
may be some opportunities to tune these areas of the code.  One example might
be to avoid doing string appends in the key encoding by switching to a
different data structure.

FWIW, I did not notice any resharding during the steady state for any of these
tests.

Mark

On 12/21/2016 08:25 PM, Somnath Roy wrote:
<< How many blobs are in each shard, and how many shards are there?
Is there any easy way to find out these other than adding some log ?

-----Original Message-----
From: Sage Weil [mailto:sweil@xxxxxxxxxx]
Sent: Wednesday, December 21, 2016 5:30 PM
To: Somnath Roy
Cc: ceph-devel
Subject: RE: Bluestore performance bottleneck

How many blobs are in each shard, and how many shards are there?

If we go this route, I think we'll want a larger threshold for the inline
blobs (stored in the onode key) so that "normal" objects without a zillion
blobs still fit in one key...

sage

On Thu, 22 Dec 2016, Somnath Roy wrote:

Ok, *205 bytes* reduction per IO by removing extents.. Thanks !

2016-12-21 20:00:07.701845 7fcff8412700 30 submit_transaction Rocksdb
transaction:
Put( Prefix = M key =
0x00000000000006fc'.0000000009.00000000000000001051' Value size = 182)
Put( Prefix = M key = 0x00000000000006fc'._fastinfo' Value size = 186)
Put( Prefix = O key =
0x7f8000000000000001b56f21ae217262'd_data.10046b8b4567.00000000000028c
6!='0xfffffffffffffffeffffffffffffffff6f0005f000'x' Value size = 45)
Put( Prefix = O key =
0x7f8000000000000001b56f21ae217262'd_data.10046b8b4567.00000000000028c
6!='0xfffffffffffffffeffffffffffffffff'o' Value size = 1300) Merge(
Prefix = b key = 0x0000000c8cd00000 Value size = 16) Merge( Prefix = b
key = 0x0000001067700000 Value size = 16)

-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx
[mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Somnath Roy
Sent: Wednesday, December 21, 2016 4:39 PM
To: Sage Weil
Cc: ceph-devel
Subject: RE: Bluestore performance bottleneck

Yeah, make sense I missed it..I will remove extents and see how much we
can save.
But, why a 4K length/offset is started touching 2 shards now if shard is
smaller  is still unclear to me ?

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@xxxxxxxxxx]
Sent: Wednesday, December 21, 2016 4:21 PM
To: Somnath Roy
Cc: ceph-devel
Subject: RE: Bluestore performance bottleneck

On Thu, 22 Dec 2016, Somnath Roy wrote:
Sage,
By reducing shard size I am able to improve bluestore +rocks performance
by 80% for 60G image. Will do detailed analysis on bigger images.

Here is what I changed to reduce decode_some() overhead. It is now
looping 5 times instead of default 33.

bluestore_extent_map_shard_max_size = 50
bluestore_extent_map_shard_target_size = 45

Fio output :
-------------
        Default:
        ----------

        rbd_iodepth32: (groupid=0, jobs=5): err= 0: pid=34420: Wed Dec
21 17:30:50 2016
  write: io=114208MB, bw=63615KB/s, iops=15903, runt=1838384msec
    slat (usec): min=3, max=3878, avg=15.63, stdev= 8.78
    clat (usec): min=618, max=176057, avg=20092.56, stdev=21017.34
     lat (usec): min=642, max=176063, avg=20108.20, stdev=21017.29
    clat percentiles (usec):
     |  1.00th=[ 1416],  5.00th=[ 2928], 10.00th=[ 4320], 20.00th=[
6112],
     | 30.00th=[ 7648], 40.00th=[ 9152], 50.00th=[10944],
60.00th=[13760],
     | 70.00th=[18816], 80.00th=[32384], 90.00th=[55552],
95.00th=[68096],
     | 99.00th=[87552], 99.50th=[94720], 99.90th=[121344],
99.95th=[129536],
     | 99.99th=[142336]

             Small shards:
              ----------------

rbd_iodepth32: (groupid=0, jobs=5): err= 0: pid=34767: Wed Dec 21
18:50:30 2016
  write: io=186917MB, bw=110585KB/s, iops=27646, runt=1730819msec
    slat (usec): min=2, max=1447, avg=14.95, stdev= 7.64
    clat (usec): min=531, max=541140, avg=11547.88, stdev=8131.72
     lat (usec): min=544, max=541156, avg=11562.83, stdev=8131.73
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    4], 10.00th=[    5], 20.00th=[
6],
     | 30.00th=[    7], 40.00th=[    9], 50.00th=[   10], 60.00th=[
12],
     | 70.00th=[   14], 80.00th=[   17], 90.00th=[   22], 95.00th=[
27],
     | 99.00th=[   37], 99.50th=[   40], 99.90th=[   51], 99.95th=[
75],
     | 99.99th=[  208]

*But* here is the overhead I am seeing which I don't quite understand.
See the per io metadata overhead for onode/shard is with smaller shard
is ~30% more.

Default:
----------

2016-12-21 17:22:31.693503 7f18aa7c1700 30 submit_transaction Rocksdb
transaction:
Put( Prefix = M key =
0x000000000000048a'.0000000009.00000000000000019093' Value size =
182) Put( Prefix = M key = 0x000000000000048a'._fastinfo' Value size
= 186) Put( Prefix = O key =
0x7f8000000000000001dc3f23'`!rbd_data.10046b8b4567.00000000000024b1!='
0xfffffffffffffffeffffffffffffffff6f0014c000'x' Value size = 669)
Put( Prefix = O key =
0x7f8000000000000001dc3f23'`!rbd_data.10046b8b4567.00000000000024b1!='
0xfffffffffffffffeffffffffffffffff'o' Value size = 462) Merge(
Prefix = b key = 0x000000069ae00000 Value size = 16) Merge( Prefix =
b key =
0x0000001330d00000 Value size = 16)

Smaller shard:
-----------------

2016-12-21 18:52:18.564423 7f3e5d167700 30 submit_transaction Rocksdb
transaction:
Put( Prefix = M key =
0x0000000000000691'.0000000009.00000000000000057248' Value size =
182) Put( Prefix = M key = 0x0000000000000691'._fastinfo' Value size
= 186) Put( Prefix = O key =
0x7f8000000000000001427b9fe9217262'd_data.10046b8b4567.0000000000003
48 f!='0xfffffffffffffffeffffffffffffffff6f00195000'x' Value size =
45) Put( Prefix = O key =
0x7f8000000000000001427b9fe9217262'd_data.10046b8b4567.0000000000003
48 f!='0xfffffffffffffffeffffffffffffffff6f0019a000'x' Value size =
45) Put( Prefix = O key =
0x7f8000000000000001427b9fe9217262'd_data.10046b8b4567.0000000000003
48 f!='0xfffffffffffffffeffffffffffffffff'o' Value size = 1505)
Merge( Prefix = b key = 0x00000006b9780000 Value size = 16) Merge(
Prefix = b key = 0x0000000c64f00000 Value size = 16)

*And* lot of times I am seeing 2 shards are written compare to default.
This will be a problem for ZS , may not be for Rocks.

Initially, I thought blobs are spanning, but, it seems not the case. See
the below log snippet , it seems onode itself is bigger now.

2016-12-21 18:40:27.734044 7f3e43934700 20
bluestore(/var/lib/ceph/osd/ceph-0)   onode
#1:d1bc6a86:::rbd_data.10046b8b4567.00000000000036a2:head# is 1505 (1503
bytes onode + 2 bytes spanning blobs + 0 bytes inline extents)

Any idea what's going on ?

The onode has a list of the shards.  Since there are more, the onode is
bigger.  I wasn't really expecting the shard count to be that high.  The
structure is:

  struct shard_info {
    uint32_t offset = 0;  ///< logical offset for start of shard
    uint32_t bytes = 0;   ///< encoded bytes
    uint32_t extents = 0; ///< extents
    DENC(shard_info, v, p) {
      denc_varint(v.offset, p);
      denc_varint(v.bytes, p);
      denc_varint(v.extents, p);
    }
    void dump(Formatter *f) const;
  };
  vector<shard_info> extent_map_shards; ///< extent map shards (if
any)

The offset is the important piece.  The byte and extent counts aren't that
important... they're mostly there so that a future reshard operation can
be more clever (merging or splitting adjacent shards instead of resharding
everything).  Well, the bytes field is currently used, but extents is not
at all.  We could just drop that field now and add it (or something else)
back in later if/when we need it...

sage

Thanks & Regards
Somnath

-----Original Message-----
From: Somnath Roy
Sent: Friday, December 16, 2016 7:23 PM
To: Sage Weil (sweil@xxxxxxxxxx)
Cc: 'ceph-devel'
Subject: RE: Bluestore performance bottleneck

Sage,
Some update on this. Without decode_some within fault_range() I am able
to drive Bluestore + rocksdb close to ~38K iops compare to ~20K iops
with decode_some. I had to disable data write because I am skipping
decode but in this device data write is not a bottleneck. I have seen
enabling/disabling data write is giving similar result. So, in NVME
device if we can optimize decode_some() for performance Bluestore
performance should bump up by ~2X.
I did some print around decode_some() and it seems it is taking ~60-121
micro sec to finish depending on bytes to decode.

Thanks & Regards
Somnath

-----Original Message-----
From: Somnath Roy
Sent: Thursday, December 15, 2016 7:30 PM
To: Sage Weil (sweil@xxxxxxxxxx)
Cc: ceph-devel
Subject: Bluestore performance bottleneck

Sage,
Today morning I was talking about 2x performance drop for Bluestore
without data/db writes for 1G vs 60G volumes and it turn out the
decode_some() is the culprit for that. Presently, I am drilling down
that function to identify what exactly causing this issue, but, most
probably it is blob decode and le->blob->get_ref() combination. Will
confirm that soon. If we can fix that we should be able to considerably
bump up end-to-end pick performance with rocks/ZS on faster NVME. Slower
devices most likely we will not be able to see any benefits other than
saving some cpu cost.

Thanks & Regards
Somnath

________________________________

PLEASE NOTE: The information contained in this electronic mail message
is intended only for the use of the designated recipient(s) named above.
If the reader of this message is not the intended recipient, you are
hereby notified that you have received this message in error and that
any review, dissemination, distribution, or copying of this message is
strictly prohibited. If you have received this communication in error,
please notify the sender by telephone or e-mail (as shown above)
immediately and destroy any and all copies of this message in your
possession (whether hard copies or electronically stored copies).

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html