Re: Bluestore performance bottleneck

Sage Weil <sweil@xxxxxxxxxx> · Thu, 22 Dec 2016 22:36:42 +0000 (UTC)

On Thu, 22 Dec 2016, Mark Nelson wrote:
> Hi Somnath,
> 
> Based on your testing, I went through and did some single OSD tests with
> master (pre-extent patch) with different sharding target/max settings on one
> of our NVMe nodes:
> 
> https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZb2ZWcVZHbzJRVHc
> 
> What I saw is that for 4k min_alloc/max_alloc/max_blob sizes, decreasing the
> sharding target/max helped to a point where it started hurting more than it
> helped.  The peak is probably somewhere between 100/200 and 200/400, though we
> may want to error on higher values rather than lower.  RSS memory usage of the
> OSD increased dramatically as the target/max sizes shrunk.  CPU usage didn't
> change dramatically, though was a little lower at the extremes where
> performance was lowest.
> 
> For reference, 16k min_alloc pegs at around 20K IOPS in this test as well,
> meaning that I think we may be hitting a common bottleneck holding us to 20K
> write IOPS per OSD.
> 
> I noticed that as the target/max size shrunk, certain code paths became more
> heavily worked however.  RocksDB generally took about a 2x larger percentage
> of the used CPU, with a lot of it going toward CRC calculations.  We also
> spent a lot more time in BlueStore::ExtentMap::init_shards doing key appends,

We can probably drop the precomputation of shard keys.  Or, keep the 
std::string there, and do it as-needed.  Probably drop it entirely, 
though, since it's just going to be the object key copy (usually 
less than 100 bytes).

Try this?
	https://github.com/ceph/ceph/pull/12634

sage

> and triming the TwoQCache. Given that the IOPS dropped precipitously, while
> overall CPU usage remained high and memory usage increased dramatically, there
> may be some opportunities to tune these areas of the code.  One example might
> be to avoid doing string appends in the key encoding by switching to a
> different data structure.
> 
> FWIW, I did not notice any resharding during the steady state for any of these
> tests.
> 
> Mark
> 
> On 12/21/2016 08:25 PM, Somnath Roy wrote:
> > << How many blobs are in each shard, and how many shards are there?
> > Is there any easy way to find out these other than adding some log ?
> > 
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> > Sent: Wednesday, December 21, 2016 5:30 PM
> > To: Somnath Roy
> > Cc: ceph-devel
> > Subject: RE: Bluestore performance bottleneck
> > 
> > How many blobs are in each shard, and how many shards are there?
> > 
> > If we go this route, I think we'll want a larger threshold for the inline
> > blobs (stored in the onode key) so that "normal" objects without a zillion
> > blobs still fit in one key...
> > 
> > sage
> > 
> > On Thu, 22 Dec 2016, Somnath Roy wrote:
> > 
> > > Ok, *205 bytes* reduction per IO by removing extents.. Thanks !
> > > 
> > > 2016-12-21 20:00:07.701845 7fcff8412700 30 submit_transaction Rocksdb
> > > transaction:
> > > Put( Prefix = M key =
> > > 0x00000000000006fc'.0000000009.00000000000000001051' Value size = 182)
> > > Put( Prefix = M key = 0x00000000000006fc'._fastinfo' Value size = 186)
> > > Put( Prefix = O key =
> > > 0x7f8000000000000001b56f21ae217262'd_data.10046b8b4567.00000000000028c
> > > 6!='0xfffffffffffffffeffffffffffffffff6f0005f000'x' Value size = 45)
> > > Put( Prefix = O key =
> > > 0x7f8000000000000001b56f21ae217262'd_data.10046b8b4567.00000000000028c
> > > 6!='0xfffffffffffffffeffffffffffffffff'o' Value size = 1300) Merge(
> > > Prefix = b key = 0x0000000c8cd00000 Value size = 16) Merge( Prefix = b
> > > key = 0x0000001067700000 Value size = 16)
> > > 
> > > 
> > > 
> > > -----Original Message-----
> > > From: ceph-devel-owner@xxxxxxxxxxxxxxx
> > > [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Somnath Roy
> > > Sent: Wednesday, December 21, 2016 4:39 PM
> > > To: Sage Weil
> > > Cc: ceph-devel
> > > Subject: RE: Bluestore performance bottleneck
> > > 
> > > Yeah, make sense I missed it..I will remove extents and see how much we
> > > can save.
> > > But, why a 4K length/offset is started touching 2 shards now if shard is
> > > smaller  is still unclear to me ?
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> > > Sent: Wednesday, December 21, 2016 4:21 PM
> > > To: Somnath Roy
> > > Cc: ceph-devel
> > > Subject: RE: Bluestore performance bottleneck
> > > 
> > > On Thu, 22 Dec 2016, Somnath Roy wrote:
> > > > Sage,
> > > > By reducing shard size I am able to improve bluestore +rocks performance
> > > > by 80% for 60G image. Will do detailed analysis on bigger images.
> > > > 
> > > > Here is what I changed to reduce decode_some() overhead. It is now
> > > > looping 5 times instead of default 33.
> > > > 
> > > > bluestore_extent_map_shard_max_size = 50
> > > > bluestore_extent_map_shard_target_size = 45
> > > > 
> > > > Fio output :
> > > > -------------
> > > >         Default:
> > > >         ----------
> > > > 
> > > >         rbd_iodepth32: (groupid=0, jobs=5): err= 0: pid=34420: Wed Dec
> > > > 21 17:30:50 2016
> > > >   write: io=114208MB, bw=63615KB/s, iops=15903, runt=1838384msec
> > > >     slat (usec): min=3, max=3878, avg=15.63, stdev= 8.78
> > > >     clat (usec): min=618, max=176057, avg=20092.56, stdev=21017.34
> > > >      lat (usec): min=642, max=176063, avg=20108.20, stdev=21017.29
> > > >     clat percentiles (usec):
> > > >      |  1.00th=[ 1416],  5.00th=[ 2928], 10.00th=[ 4320], 20.00th=[
> > > > 6112],
> > > >      | 30.00th=[ 7648], 40.00th=[ 9152], 50.00th=[10944],
> > > > 60.00th=[13760],
> > > >      | 70.00th=[18816], 80.00th=[32384], 90.00th=[55552],
> > > > 95.00th=[68096],
> > > >      | 99.00th=[87552], 99.50th=[94720], 99.90th=[121344],
> > > > 99.95th=[129536],
> > > >      | 99.99th=[142336]
> > > > 
> > > >              Small shards:
> > > >               ----------------
> > > > 
> > > > rbd_iodepth32: (groupid=0, jobs=5): err= 0: pid=34767: Wed Dec 21
> > > > 18:50:30 2016
> > > >   write: io=186917MB, bw=110585KB/s, iops=27646, runt=1730819msec
> > > >     slat (usec): min=2, max=1447, avg=14.95, stdev= 7.64
> > > >     clat (usec): min=531, max=541140, avg=11547.88, stdev=8131.72
> > > >      lat (usec): min=544, max=541156, avg=11562.83, stdev=8131.73
> > > >     clat percentiles (msec):
> > > >      |  1.00th=[    3],  5.00th=[    4], 10.00th=[    5], 20.00th=[
> > > > 6],
> > > >      | 30.00th=[    7], 40.00th=[    9], 50.00th=[   10], 60.00th=[
> > > > 12],
> > > >      | 70.00th=[   14], 80.00th=[   17], 90.00th=[   22], 95.00th=[
> > > > 27],
> > > >      | 99.00th=[   37], 99.50th=[   40], 99.90th=[   51], 99.95th=[
> > > > 75],
> > > >      | 99.99th=[  208]
> > > > 
> > > > 
> > > > 
> > > > *But* here is the overhead I am seeing which I don't quite understand.
> > > > See the per io metadata overhead for onode/shard is with smaller shard
> > > > is ~30% more.
> > > > 
> > > > Default:
> > > > ----------
> > > > 
> > > > 2016-12-21 17:22:31.693503 7f18aa7c1700 30 submit_transaction Rocksdb
> > > > transaction:
> > > > Put( Prefix = M key =
> > > > 0x000000000000048a'.0000000009.00000000000000019093' Value size =
> > > > 182) Put( Prefix = M key = 0x000000000000048a'._fastinfo' Value size
> > > > = 186) Put( Prefix = O key =
> > > > 0x7f8000000000000001dc3f23'`!rbd_data.10046b8b4567.00000000000024b1!='
> > > > 0xfffffffffffffffeffffffffffffffff6f0014c000'x' Value size = 669)
> > > > Put( Prefix = O key =
> > > > 0x7f8000000000000001dc3f23'`!rbd_data.10046b8b4567.00000000000024b1!='
> > > > 0xfffffffffffffffeffffffffffffffff'o' Value size = 462) Merge(
> > > > Prefix = b key = 0x000000069ae00000 Value size = 16) Merge( Prefix =
> > > > b key =
> > > > 0x0000001330d00000 Value size = 16)
> > > > 
> > > > 
> > > > Smaller shard:
> > > > -----------------
> > > > 
> > > > 2016-12-21 18:52:18.564423 7f3e5d167700 30 submit_transaction Rocksdb
> > > > transaction:
> > > > Put( Prefix = M key =
> > > > 0x0000000000000691'.0000000009.00000000000000057248' Value size =
> > > > 182) Put( Prefix = M key = 0x0000000000000691'._fastinfo' Value size
> > > > = 186) Put( Prefix = O key =
> > > > 0x7f8000000000000001427b9fe9217262'd_data.10046b8b4567.0000000000003
> > > > 48 f!='0xfffffffffffffffeffffffffffffffff6f00195000'x' Value size =
> > > > 45) Put( Prefix = O key =
> > > > 0x7f8000000000000001427b9fe9217262'd_data.10046b8b4567.0000000000003
> > > > 48 f!='0xfffffffffffffffeffffffffffffffff6f0019a000'x' Value size =
> > > > 45) Put( Prefix = O key =
> > > > 0x7f8000000000000001427b9fe9217262'd_data.10046b8b4567.0000000000003
> > > > 48 f!='0xfffffffffffffffeffffffffffffffff'o' Value size = 1505)
> > > > Merge( Prefix = b key = 0x00000006b9780000 Value size = 16) Merge(
> > > > Prefix = b key = 0x0000000c64f00000 Value size = 16)
> > > > 
> > > > 
> > > > *And* lot of times I am seeing 2 shards are written compare to default.
> > > > This will be a problem for ZS , may not be for Rocks.
> > > > 
> > > > Initially, I thought blobs are spanning, but, it seems not the case. See
> > > > the below log snippet , it seems onode itself is bigger now.
> > > > 
> > > > 2016-12-21 18:40:27.734044 7f3e43934700 20
> > > > bluestore(/var/lib/ceph/osd/ceph-0)   onode
> > > > #1:d1bc6a86:::rbd_data.10046b8b4567.00000000000036a2:head# is 1505 (1503
> > > > bytes onode + 2 bytes spanning blobs + 0 bytes inline extents)
> > > > 
> > > > Any idea what's going on ?
> > > 
> > > The onode has a list of the shards.  Since there are more, the onode is
> > > bigger.  I wasn't really expecting the shard count to be that high.  The
> > > structure is:
> > > 
> > >   struct shard_info {
> > >     uint32_t offset = 0;  ///< logical offset for start of shard
> > >     uint32_t bytes = 0;   ///< encoded bytes
> > >     uint32_t extents = 0; ///< extents
> > >     DENC(shard_info, v, p) {
> > >       denc_varint(v.offset, p);
> > >       denc_varint(v.bytes, p);
> > >       denc_varint(v.extents, p);
> > >     }
> > >     void dump(Formatter *f) const;
> > >   };
> > >   vector<shard_info> extent_map_shards; ///< extent map shards (if
> > > any)
> > > 
> > > The offset is the important piece.  The byte and extent counts aren't that
> > > important... they're mostly there so that a future reshard operation can
> > > be more clever (merging or splitting adjacent shards instead of resharding
> > > everything).  Well, the bytes field is currently used, but extents is not
> > > at all.  We could just drop that field now and add it (or something else)
> > > back in later if/when we need it...
> > > 
> > > sage
> > > 
> > > 
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > 
> > > > -----Original Message-----
> > > > From: Somnath Roy
> > > > Sent: Friday, December 16, 2016 7:23 PM
> > > > To: Sage Weil (sweil@xxxxxxxxxx)
> > > > Cc: 'ceph-devel'
> > > > Subject: RE: Bluestore performance bottleneck
> > > > 
> > > > Sage,
> > > > Some update on this. Without decode_some within fault_range() I am able
> > > > to drive Bluestore + rocksdb close to ~38K iops compare to ~20K iops
> > > > with decode_some. I had to disable data write because I am skipping
> > > > decode but in this device data write is not a bottleneck. I have seen
> > > > enabling/disabling data write is giving similar result. So, in NVME
> > > > device if we can optimize decode_some() for performance Bluestore
> > > > performance should bump up by ~2X.
> > > > I did some print around decode_some() and it seems it is taking ~60-121
> > > > micro sec to finish depending on bytes to decode.
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > -----Original Message-----
> > > > From: Somnath Roy
> > > > Sent: Thursday, December 15, 2016 7:30 PM
> > > > To: Sage Weil (sweil@xxxxxxxxxx)
> > > > Cc: ceph-devel
> > > > Subject: Bluestore performance bottleneck
> > > > 
> > > > Sage,
> > > > Today morning I was talking about 2x performance drop for Bluestore
> > > > without data/db writes for 1G vs 60G volumes and it turn out the
> > > > decode_some() is the culprit for that. Presently, I am drilling down
> > > > that function to identify what exactly causing this issue, but, most
> > > > probably it is blob decode and le->blob->get_ref() combination. Will
> > > > confirm that soon. If we can fix that we should be able to considerably
> > > > bump up end-to-end pick performance with rocks/ZS on faster NVME. Slower
> > > > devices most likely we will not be able to see any benefits other than
> > > > saving some cpu cost.
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > 
> > > > 
> > > > ________________________________
> > > > 
> > > > PLEASE NOTE: The information contained in this electronic mail message
> > > > is intended only for the use of the designated recipient(s) named above.
> > > > If the reader of this message is not the intended recipient, you are
> > > > hereby notified that you have received this message in error and that
> > > > any review, dissemination, distribution, or copying of this message is
> > > > strictly prohibited. If you have received this communication in error,
> > > > please notify the sender by telephone or e-mail (as shown above)
> > > > immediately and destroy any and all copies of this message in your
> > > > possession (whether hard copies or electronically stored copies).
> > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > 
> > > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html