RE: Bluestore performance bottleneck

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Thu, 22 Dec 2016 23:14:32 +0000

It would be good to know if the same memory consumption deltas are visible in the various mempool pools. If not, we have some data structures that need to be mempool-ized.

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@xxxxxxxxxxx

> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@xxxxxxxxxx]
> Sent: Thursday, December 22, 2016 3:13 PM
> To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Somnath Roy
> <Somnath.Roy@xxxxxxxxxxx>; Sage Weil <sweil@xxxxxxxxxx>
> Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx>
> Subject: Re: Bluestore performance bottleneck
> 
> I'm compiling a new branch based on a couple of new PRs and will retest that
> will probably alter the memory and CPU usage somewhat.  If it's still there I'll
> track it down in massif and we'll see what we find.
> 
> Mark
> 
> On 12/22/2016 05:10 PM, Allen Samuels wrote:
> > Dramatic changes to the RSS usage due to changes in these parameters
> seems completely terrifying to me. Seems like something about the oNode
> trimming logic isn't working correctly.
> >
> > Allen Samuels
> > SanDisk |a Western Digital brand
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx
> >
> >
> >> -----Original Message-----
> >> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> >> owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson
> >> Sent: Thursday, December 22, 2016 2:23 PM
> >> To: Somnath Roy <Somnath.Roy@xxxxxxxxxxx>; Sage Weil
> >> <sweil@xxxxxxxxxx>
> >> Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx>
> >> Subject: Re: Bluestore performance bottleneck
> >>
> >> Hi Somnath,
> >>
> >> Based on your testing, I went through and did some single OSD tests with
> >> master (pre-extent patch) with different sharding target/max settings on
> >> one of our NVMe nodes:
> >>
> >>
> https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZb2ZWcVZ
> >> HbzJRVHc
> >>
> >> What I saw is that for 4k min_alloc/max_alloc/max_blob sizes, decreasing
> the
> >> sharding target/max helped to a point where it started hurting more than
> it
> >> helped.  The peak is probably somewhere between 100/200 and 200/400,
> >> though we may want to error on higher values rather than lower.
> >>   RSS memory usage of the OSD increased dramatically as the target/max
> >> sizes shrunk.  CPU usage didn't change dramatically, though was a little
> lower
> >> at the extremes where performance was lowest.
> >>
> >> For reference, 16k min_alloc pegs at around 20K IOPS in this test as well,
> >> meaning that I think we may be hitting a common bottleneck holding us to
> >> 20K write IOPS per OSD.
> >>
> >> I noticed that as the target/max size shrunk, certain code paths became
> >> more heavily worked however.  RocksDB generally took about a 2x larger
> >> percentage of the used CPU, with a lot of it going toward CRC calculations.
> >> We also spent a lot more time in BlueStore::ExtentMap::init_shards doing
> >> key appends, and triming the TwoQCache. Given that the IOPS dropped
> >> precipitously, while overall CPU usage remained high and memory usage
> >> increased dramatically, there may be some opportunities to tune these
> areas
> >> of the code.  One example might be to avoid doing string appends in the
> key
> >> encoding by switching to a different data structure.
> >>
> >> FWIW, I did not notice any resharding during the steady state for any of
> >> these tests.
> >>
> >> Mark
> >>
> >> On 12/21/2016 08:25 PM, Somnath Roy wrote:
> >>> << How many blobs are in each shard, and how many shards are there?
> >>> Is there any easy way to find out these other than adding some log ?
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> >>> Sent: Wednesday, December 21, 2016 5:30 PM
> >>> To: Somnath Roy
> >>> Cc: ceph-devel
> >>> Subject: RE: Bluestore performance bottleneck
> >>>
> >>> How many blobs are in each shard, and how many shards are there?
> >>>
> >>> If we go this route, I think we'll want a larger threshold for the inline
> blobs
> >> (stored in the onode key) so that "normal" objects without a zillion blobs
> still
> >> fit in one key...
> >>>
> >>> sage
> >>>
> >>> On Thu, 22 Dec 2016, Somnath Roy wrote:
> >>>
> >>>> Ok, *205 bytes* reduction per IO by removing extents.. Thanks !
> >>>>
> >>>> 2016-12-21 20:00:07.701845 7fcff8412700 30 submit_transaction
> Rocksdb
> >> transaction:
> >>>> Put( Prefix = M key =
> >>>> 0x00000000000006fc'.0000000009.00000000000000001051' Value size =
> >>>> 182) Put( Prefix = M key = 0x00000000000006fc'._fastinfo' Value size
> >>>> = 186) Put( Prefix = O key =
> >>>>
> >>
> 0x7f8000000000000001b56f21ae217262'd_data.10046b8b4567.0000000000002
> >> 8
> >>>> c 6!='0xfffffffffffffffeffffffffffffffff6f0005f000'x' Value size =
> >>>> 45) Put( Prefix = O key =
> >>>>
> >>
> 0x7f8000000000000001b56f21ae217262'd_data.10046b8b4567.0000000000002
> >> 8
> >>>> c 6!='0xfffffffffffffffeffffffffffffffff'o' Value size = 1300) Merge(
> >>>> Prefix = b key = 0x0000000c8cd00000 Value size = 16) Merge( Prefix =
> >>>> b key = 0x0000001067700000 Value size = 16)
> >>>>
> >>>>
> >>>>
> >>>> -----Original Message-----
> >>>> From: ceph-devel-owner@xxxxxxxxxxxxxxx
> >>>> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Somnath
> Roy
> >>>> Sent: Wednesday, December 21, 2016 4:39 PM
> >>>> To: Sage Weil
> >>>> Cc: ceph-devel
> >>>> Subject: RE: Bluestore performance bottleneck
> >>>>
> >>>> Yeah, make sense I missed it..I will remove extents and see how much
> we
> >> can save.
> >>>> But, why a 4K length/offset is started touching 2 shards now if shard is
> >> smaller  is still unclear to me ?
> >>>>
> >>>> Thanks & Regards
> >>>> Somnath
> >>>>
> >>>> -----Original Message-----
> >>>> From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> >>>> Sent: Wednesday, December 21, 2016 4:21 PM
> >>>> To: Somnath Roy
> >>>> Cc: ceph-devel
> >>>> Subject: RE: Bluestore performance bottleneck
> >>>>
> >>>> On Thu, 22 Dec 2016, Somnath Roy wrote:
> >>>>> Sage,
> >>>>> By reducing shard size I am able to improve bluestore +rocks
> >> performance by 80% for 60G image. Will do detailed analysis on bigger
> >> images.
> >>>>>
> >>>>> Here is what I changed to reduce decode_some() overhead. It is now
> >> looping 5 times instead of default 33.
> >>>>>
> >>>>> bluestore_extent_map_shard_max_size = 50
> >>>>> bluestore_extent_map_shard_target_size = 45
> >>>>>
> >>>>> Fio output :
> >>>>> -------------
> >>>>>         Default:
> >>>>>         ----------
> >>>>>
> >>>>>         rbd_iodepth32: (groupid=0, jobs=5): err= 0: pid=34420: Wed Dec
> 21
> >> 17:30:50 2016
> >>>>>   write: io=114208MB, bw=63615KB/s, iops=15903, runt=1838384msec
> >>>>>     slat (usec): min=3, max=3878, avg=15.63, stdev= 8.78
> >>>>>     clat (usec): min=618, max=176057, avg=20092.56, stdev=21017.34
> >>>>>      lat (usec): min=642, max=176063, avg=20108.20, stdev=21017.29
> >>>>>     clat percentiles (usec):
> >>>>>      |  1.00th=[ 1416],  5.00th=[ 2928], 10.00th=[ 4320], 20.00th=[ 6112],
> >>>>>      | 30.00th=[ 7648], 40.00th=[ 9152], 50.00th=[10944],
> 60.00th=[13760],
> >>>>>      | 70.00th=[18816], 80.00th=[32384], 90.00th=[55552],
> 95.00th=[68096],
> >>>>>      | 99.00th=[87552], 99.50th=[94720], 99.90th=[121344],
> >> 99.95th=[129536],
> >>>>>      | 99.99th=[142336]
> >>>>>
> >>>>>              Small shards:
> >>>>>               ----------------
> >>>>>
> >>>>> rbd_iodepth32: (groupid=0, jobs=5): err= 0: pid=34767: Wed Dec 21
> >> 18:50:30 2016
> >>>>>   write: io=186917MB, bw=110585KB/s, iops=27646,
> runt=1730819msec
> >>>>>     slat (usec): min=2, max=1447, avg=14.95, stdev= 7.64
> >>>>>     clat (usec): min=531, max=541140, avg=11547.88, stdev=8131.72
> >>>>>      lat (usec): min=544, max=541156, avg=11562.83, stdev=8131.73
> >>>>>     clat percentiles (msec):
> >>>>>      |  1.00th=[    3],  5.00th=[    4], 10.00th=[    5], 20.00th=[    6],
> >>>>>      | 30.00th=[    7], 40.00th=[    9], 50.00th=[   10], 60.00th=[   12],
> >>>>>      | 70.00th=[   14], 80.00th=[   17], 90.00th=[   22], 95.00th=[   27],
> >>>>>      | 99.00th=[   37], 99.50th=[   40], 99.90th=[   51], 99.95th=[   75],
> >>>>>      | 99.99th=[  208]
> >>>>>
> >>>>>
> >>>>>
> >>>>> *But* here is the overhead I am seeing which I don't quite
> understand.
> >> See the per io metadata overhead for onode/shard is with smaller shard is
> >> ~30% more.
> >>>>>
> >>>>> Default:
> >>>>> ----------
> >>>>>
> >>>>> 2016-12-21 17:22:31.693503 7f18aa7c1700 30 submit_transaction
> Rocksdb
> >> transaction:
> >>>>> Put( Prefix = M key =
> >>>>> 0x000000000000048a'.0000000009.00000000000000019093' Value size =
> >>>>> 182) Put( Prefix = M key = 0x000000000000048a'._fastinfo' Value size
> >>>>> = 186) Put( Prefix = O key =
> >>>>>
> >>
> 0x7f8000000000000001dc3f23'`!rbd_data.10046b8b4567.00000000000024b1!='
> >>>>> 0xfffffffffffffffeffffffffffffffff6f0014c000'x' Value size = 669)
> >>>>> Put( Prefix = O key =
> >>>>>
> >>
> 0x7f8000000000000001dc3f23'`!rbd_data.10046b8b4567.00000000000024b1!='
> >>>>> 0xfffffffffffffffeffffffffffffffff'o' Value size = 462) Merge(
> >>>>> Prefix = b key = 0x000000069ae00000 Value size = 16) Merge( Prefix =
> >>>>> b key =
> >>>>> 0x0000001330d00000 Value size = 16)
> >>>>>
> >>>>>
> >>>>> Smaller shard:
> >>>>> -----------------
> >>>>>
> >>>>> 2016-12-21 18:52:18.564423 7f3e5d167700 30 submit_transaction
> >> Rocksdb transaction:
> >>>>> Put( Prefix = M key =
> >>>>> 0x0000000000000691'.0000000009.00000000000000057248' Value size =
> >>>>> 182) Put( Prefix = M key = 0x0000000000000691'._fastinfo' Value size
> >>>>> = 186) Put( Prefix = O key =
> >>>>>
> >>
> 0x7f8000000000000001427b9fe9217262'd_data.10046b8b4567.0000000000003
> >>>>> 48 f!='0xfffffffffffffffeffffffffffffffff6f00195000'x' Value size =
> >>>>> 45) Put( Prefix = O key =
> >>>>>
> >>
> 0x7f8000000000000001427b9fe9217262'd_data.10046b8b4567.0000000000003
> >>>>> 48 f!='0xfffffffffffffffeffffffffffffffff6f0019a000'x' Value size =
> >>>>> 45) Put( Prefix = O key =
> >>>>>
> >>
> 0x7f8000000000000001427b9fe9217262'd_data.10046b8b4567.0000000000003
> >>>>> 48 f!='0xfffffffffffffffeffffffffffffffff'o' Value size = 1505)
> >>>>> Merge( Prefix = b key = 0x00000006b9780000 Value size = 16) Merge(
> >>>>> Prefix = b key = 0x0000000c64f00000 Value size = 16)
> >>>>>
> >>>>>
> >>>>> *And* lot of times I am seeing 2 shards are written compare to
> default.
> >> This will be a problem for ZS , may not be for Rocks.
> >>>>>
> >>>>> Initially, I thought blobs are spanning, but, it seems not the case. See
> the
> >> below log snippet , it seems onode itself is bigger now.
> >>>>>
> >>>>> 2016-12-21 18:40:27.734044 7f3e43934700 20
> >> bluestore(/var/lib/ceph/osd/ceph-0)   onode
> >> #1:d1bc6a86:::rbd_data.10046b8b4567.00000000000036a2:head# is 1505
> >> (1503 bytes onode + 2 bytes spanning blobs + 0 bytes inline extents)
> >>>>>
> >>>>> Any idea what's going on ?
> >>>>
> >>>> The onode has a list of the shards.  Since there are more, the onode is
> >> bigger.  I wasn't really expecting the shard count to be that high.  The
> >> structure is:
> >>>>
> >>>>   struct shard_info {
> >>>>     uint32_t offset = 0;  ///< logical offset for start of shard
> >>>>     uint32_t bytes = 0;   ///< encoded bytes
> >>>>     uint32_t extents = 0; ///< extents
> >>>>     DENC(shard_info, v, p) {
> >>>>       denc_varint(v.offset, p);
> >>>>       denc_varint(v.bytes, p);
> >>>>       denc_varint(v.extents, p);
> >>>>     }
> >>>>     void dump(Formatter *f) const;
> >>>>   };
> >>>>   vector<shard_info> extent_map_shards; ///< extent map shards (if
> >>>> any)
> >>>>
> >>>> The offset is the important piece.  The byte and extent counts aren't
> that
> >> important... they're mostly there so that a future reshard operation can
> be
> >> more clever (merging or splitting adjacent shards instead of resharding
> >> everything).  Well, the bytes field is currently used, but extents is not at
> all.
> >> We could just drop that field now and add it (or something else) back in
> later
> >> if/when we need it...
> >>>>
> >>>> sage
> >>>>
> >>>>
> >>>>>
> >>>>> Thanks & Regards
> >>>>> Somnath
> >>>>>
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Somnath Roy
> >>>>> Sent: Friday, December 16, 2016 7:23 PM
> >>>>> To: Sage Weil (sweil@xxxxxxxxxx)
> >>>>> Cc: 'ceph-devel'
> >>>>> Subject: RE: Bluestore performance bottleneck
> >>>>>
> >>>>> Sage,
> >>>>> Some update on this. Without decode_some within fault_range() I
> am
> >> able to drive Bluestore + rocksdb close to ~38K iops compare to ~20K iops
> >> with decode_some. I had to disable data write because I am skipping
> decode
> >> but in this device data write is not a bottleneck. I have seen
> >> enabling/disabling data write is giving similar result. So, in NVME device if
> we
> >> can optimize decode_some() for performance Bluestore performance
> >> should bump up by ~2X.
> >>>>> I did some print around decode_some() and it seems it is taking ~60-
> 121
> >> micro sec to finish depending on bytes to decode.
> >>>>>
> >>>>> Thanks & Regards
> >>>>> Somnath
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Somnath Roy
> >>>>> Sent: Thursday, December 15, 2016 7:30 PM
> >>>>> To: Sage Weil (sweil@xxxxxxxxxx)
> >>>>> Cc: ceph-devel
> >>>>> Subject: Bluestore performance bottleneck
> >>>>>
> >>>>> Sage,
> >>>>> Today morning I was talking about 2x performance drop for Bluestore
> >> without data/db writes for 1G vs 60G volumes and it turn out the
> >> decode_some() is the culprit for that. Presently, I am drilling down that
> >> function to identify what exactly causing this issue, but, most probably it is
> >> blob decode and le->blob->get_ref() combination. Will confirm that soon.
> If
> >> we can fix that we should be able to considerably bump up end-to-end
> pick
> >> performance with rocks/ZS on faster NVME. Slower devices most likely
> we
> >> will not be able to see any benefits other than saving some cpu cost.
> >>>>>
> >>>>> Thanks & Regards
> >>>>> Somnath
> >>>>>
> >>>>>
> >>>>>
> >>>>> ________________________________
> >>>>>
> >>>>> PLEASE NOTE: The information contained in this electronic mail
> message
> >> is intended only for the use of the designated recipient(s) named above.
> If
> >> the reader of this message is not the intended recipient, you are hereby
> >> notified that you have received this message in error and that any review,
> >> dissemination, distribution, or copying of this message is strictly
> prohibited. If
> >> you have received this communication in error, please notify the sender
> by
> >> telephone or e-mail (as shown above) immediately and destroy any and
> all
> >> copies of this message in your possession (whether hard copies or
> >> electronically stored copies).
> >>>>>
> >>>>> --
> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> >> majordomo
> >>>>> info at  http://vger.kernel.org/majordomo-info.html
> >>>>>
> >>>>>
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> >> majordomo
> >>>> info at  http://vger.kernel.org/majordomo-info.html
> >>>>
> >>>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> >> majordomo
> >>> info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the
> >> body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info
> at
> >> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html