RE: Bluestore performance bottleneck

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



It would be good to know if the same memory consumption deltas are visible in the various mempool pools. If not, we have some data structures that need to be mempool-ized.


Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@xxxxxxxxxxx

> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@xxxxxxxxxx]
> Sent: Thursday, December 22, 2016 3:13 PM
> To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Somnath Roy
> <Somnath.Roy@xxxxxxxxxxx>; Sage Weil <sweil@xxxxxxxxxx>
> Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx>
> Subject: Re: Bluestore performance bottleneck
> 
> I'm compiling a new branch based on a couple of new PRs and will retest that
> will probably alter the memory and CPU usage somewhat.  If it's still there I'll
> track it down in massif and we'll see what we find.
> 
> Mark
> 
> On 12/22/2016 05:10 PM, Allen Samuels wrote:
> > Dramatic changes to the RSS usage due to changes in these parameters
> seems completely terrifying to me. Seems like something about the oNode
> trimming logic isn't working correctly.
> >
> > Allen Samuels
> > SanDisk |a Western Digital brand
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx
> >
> >
> >> -----Original Message-----
> >> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> >> owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson
> >> Sent: Thursday, December 22, 2016 2:23 PM
> >> To: Somnath Roy <Somnath.Roy@xxxxxxxxxxx>; Sage Weil
> >> <sweil@xxxxxxxxxx>
> >> Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx>
> >> Subject: Re: Bluestore performance bottleneck
> >>
> >> Hi Somnath,
> >>
> >> Based on your testing, I went through and did some single OSD tests with
> >> master (pre-extent patch) with different sharding target/max settings on
> >> one of our NVMe nodes:
> >>
> >>
> https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZb2ZWcVZ
> >> HbzJRVHc
> >>
> >> What I saw is that for 4k min_alloc/max_alloc/max_blob sizes, decreasing
> the
> >> sharding target/max helped to a point where it started hurting more than
> it
> >> helped.  The peak is probably somewhere between 100/200 and 200/400,
> >> though we may want to error on higher values rather than lower.
> >>   RSS memory usage of the OSD increased dramatically as the target/max
> >> sizes shrunk.  CPU usage didn't change dramatically, though was a little
> lower
> >> at the extremes where performance was lowest.
> >>
> >> For reference, 16k min_alloc pegs at around 20K IOPS in this test as well,
> >> meaning that I think we may be hitting a common bottleneck holding us to
> >> 20K write IOPS per OSD.
> >>
> >> I noticed that as the target/max size shrunk, certain code paths became
> >> more heavily worked however.  RocksDB generally took about a 2x larger
> >> percentage of the used CPU, with a lot of it going toward CRC calculations.
> >> We also spent a lot more time in BlueStore::ExtentMap::init_shards doing
> >> key appends, and triming the TwoQCache. Given that the IOPS dropped
> >> precipitously, while overall CPU usage remained high and memory usage
> >> increased dramatically, there may be some opportunities to tune these
> areas
> >> of the code.  One example might be to avoid doing string appends in the
> key
> >> encoding by switching to a different data structure.
> >>
> >> FWIW, I did not notice any resharding during the steady state for any of
> >> these tests.
> >>
> >> Mark
> >>
> >> On 12/21/2016 08:25 PM, Somnath Roy wrote:
> >>> << How many blobs are in each shard, and how many shards are there?
> >>> Is there any easy way to find out these other than adding some log ?
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> >>> Sent: Wednesday, December 21, 2016 5:30 PM
> >>> To: Somnath Roy
> >>> Cc: ceph-devel
> >>> Subject: RE: Bluestore performance bottleneck
> >>>
> >>> How many blobs are in each shard, and how many shards are there?
> >>>
> >>> If we go this route, I think we'll want a larger threshold for the inline
> blobs
> >> (stored in the onode key) so that "normal" objects without a zillion blobs
> still
> >> fit in one key...
> >>>
> >>> sage
> >>>
> >>> On Thu, 22 Dec 2016, Somnath Roy wrote:
> >>>
> >>>> Ok, *205 bytes* reduction per IO by removing extents.. Thanks !
> >>>>
> >>>> 2016-12-21 20:00:07.701845 7fcff8412700 30 submit_transaction
> Rocksdb
> >> transaction:
> >>>> Put( Prefix = M key =
> >>>> 0x00000000000006fc'.0000000009.00000000000000001051' Value size =
> >>>> 182) Put( Prefix = M key = 0x00000000000006fc'._fastinfo' Value size
> >>>> = 186) Put( Prefix = O key =
> >>>>
> >>
> 0x7f8000000000000001b56f21ae217262'd_data.10046b8b4567.0000000000002
> >> 8
> >>>> c 6!='0xfffffffffffffffeffffffffffffffff6f0005f000'x' Value size =
> >>>> 45) Put( Prefix = O key =
> >>>>
> >>
> 0x7f8000000000000001b56f21ae217262'd_data.10046b8b4567.0000000000002
> >> 8
> >>>> c 6!='0xfffffffffffffffeffffffffffffffff'o' Value size = 1300) Merge(
> >>>> Prefix = b key = 0x0000000c8cd00000 Value size = 16) Merge( Prefix =
> >>>> b key = 0x0000001067700000 Value size = 16)
> >>>>
> >>>>
> >>>>
> >>>> -----Original Message-----
> >>>> From: ceph-devel-owner@xxxxxxxxxxxxxxx
> >>>> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Somnath
> Roy
> >>>> Sent: Wednesday, December 21, 2016 4:39 PM
> >>>> To: Sage Weil
> >>>> Cc: ceph-devel
> >>>> Subject: RE: Bluestore performance bottleneck
> >>>>
> >>>> Yeah, make sense I missed it..I will remove extents and see how much
> we
> >> can save.
> >>>> But, why a 4K length/offset is started touching 2 shards now if shard is
> >> smaller  is still unclear to me ?
> >>>>
> >>>> Thanks & Regards
> >>>> Somnath
> >>>>
> >>>> -----Original Message-----
> >>>> From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> >>>> Sent: Wednesday, December 21, 2016 4:21 PM
> >>>> To: Somnath Roy
> >>>> Cc: ceph-devel
> >>>> Subject: RE: Bluestore performance bottleneck
> >>>>
> >>>> On Thu, 22 Dec 2016, Somnath Roy wrote:
> >>>>> Sage,
> >>>>> By reducing shard size I am able to improve bluestore +rocks
> >> performance by 80% for 60G image. Will do detailed analysis on bigger
> >> images.
> >>>>>
> >>>>> Here is what I changed to reduce decode_some() overhead. It is now
> >> looping 5 times instead of default 33.
> >>>>>
> >>>>> bluestore_extent_map_shard_max_size = 50
> >>>>> bluestore_extent_map_shard_target_size = 45
> >>>>>
> >>>>> Fio output :
> >>>>> -------------
> >>>>>         Default:
> >>>>>         ----------
> >>>>>
> >>>>>         rbd_iodepth32: (groupid=0, jobs=5): err= 0: pid=34420: Wed Dec
> 21
> >> 17:30:50 2016
> >>>>>   write: io=114208MB, bw=63615KB/s, iops=15903, runt=1838384msec
> >>>>>     slat (usec): min=3, max=3878, avg=15.63, stdev= 8.78
> >>>>>     clat (usec): min=618, max=176057, avg=20092.56, stdev=21017.34
> >>>>>      lat (usec): min=642, max=176063, avg=20108.20, stdev=21017.29
> >>>>>     clat percentiles (usec):
> >>>>>      |  1.00th=[ 1416],  5.00th=[ 2928], 10.00th=[ 4320], 20.00th=[ 6112],
> >>>>>      | 30.00th=[ 7648], 40.00th=[ 9152], 50.00th=[10944],
> 60.00th=[13760],
> >>>>>      | 70.00th=[18816], 80.00th=[32384], 90.00th=[55552],
> 95.00th=[68096],
> >>>>>      | 99.00th=[87552], 99.50th=[94720], 99.90th=[121344],
> >> 99.95th=[129536],
> >>>>>      | 99.99th=[142336]
> >>>>>
> >>>>>              Small shards:
> >>>>>               ----------------
> >>>>>
> >>>>> rbd_iodepth32: (groupid=0, jobs=5): err= 0: pid=34767: Wed Dec 21
> >> 18:50:30 2016
> >>>>>   write: io=186917MB, bw=110585KB/s, iops=27646,
> runt=1730819msec
> >>>>>     slat (usec): min=2, max=1447, avg=14.95, stdev= 7.64
> >>>>>     clat (usec): min=531, max=541140, avg=11547.88, stdev=8131.72
> >>>>>      lat (usec): min=544, max=541156, avg=11562.83, stdev=8131.73
> >>>>>     clat percentiles (msec):
> >>>>>      |  1.00th=[    3],  5.00th=[    4], 10.00th=[    5], 20.00th=[    6],
> >>>>>      | 30.00th=[    7], 40.00th=[    9], 50.00th=[   10], 60.00th=[   12],
> >>>>>      | 70.00th=[   14], 80.00th=[   17], 90.00th=[   22], 95.00th=[   27],
> >>>>>      | 99.00th=[   37], 99.50th=[   40], 99.90th=[   51], 99.95th=[   75],
> >>>>>      | 99.99th=[  208]
> >>>>>
> >>>>>
> >>>>>
> >>>>> *But* here is the overhead I am seeing which I don't quite
> understand.
> >> See the per io metadata overhead for onode/shard is with smaller shard is
> >> ~30% more.
> >>>>>
> >>>>> Default:
> >>>>> ----------
> >>>>>
> >>>>> 2016-12-21 17:22:31.693503 7f18aa7c1700 30 submit_transaction
> Rocksdb
> >> transaction:
> >>>>> Put( Prefix = M key =
> >>>>> 0x000000000000048a'.0000000009.00000000000000019093' Value size =
> >>>>> 182) Put( Prefix = M key = 0x000000000000048a'._fastinfo' Value size
> >>>>> = 186) Put( Prefix = O key =
> >>>>>
> >>
> 0x7f8000000000000001dc3f23'`!rbd_data.10046b8b4567.00000000000024b1!='
> >>>>> 0xfffffffffffffffeffffffffffffffff6f0014c000'x' Value size = 669)
> >>>>> Put( Prefix = O key =
> >>>>>
> >>
> 0x7f8000000000000001dc3f23'`!rbd_data.10046b8b4567.00000000000024b1!='
> >>>>> 0xfffffffffffffffeffffffffffffffff'o' Value size = 462) Merge(
> >>>>> Prefix = b key = 0x000000069ae00000 Value size = 16) Merge( Prefix =
> >>>>> b key =
> >>>>> 0x0000001330d00000 Value size = 16)
> >>>>>
> >>>>>
> >>>>> Smaller shard:
> >>>>> -----------------
> >>>>>
> >>>>> 2016-12-21 18:52:18.564423 7f3e5d167700 30 submit_transaction
> >> Rocksdb transaction:
> >>>>> Put( Prefix = M key =
> >>>>> 0x0000000000000691'.0000000009.00000000000000057248' Value size =
> >>>>> 182) Put( Prefix = M key = 0x0000000000000691'._fastinfo' Value size
> >>>>> = 186) Put( Prefix = O key =
> >>>>>
> >>
> 0x7f8000000000000001427b9fe9217262'd_data.10046b8b4567.0000000000003
> >>>>> 48 f!='0xfffffffffffffffeffffffffffffffff6f00195000'x' Value size =
> >>>>> 45) Put( Prefix = O key =
> >>>>>
> >>
> 0x7f8000000000000001427b9fe9217262'd_data.10046b8b4567.0000000000003
> >>>>> 48 f!='0xfffffffffffffffeffffffffffffffff6f0019a000'x' Value size =
> >>>>> 45) Put( Prefix = O key =
> >>>>>
> >>
> 0x7f8000000000000001427b9fe9217262'd_data.10046b8b4567.0000000000003
> >>>>> 48 f!='0xfffffffffffffffeffffffffffffffff'o' Value size = 1505)
> >>>>> Merge( Prefix = b key = 0x00000006b9780000 Value size = 16) Merge(
> >>>>> Prefix = b key = 0x0000000c64f00000 Value size = 16)
> >>>>>
> >>>>>
> >>>>> *And* lot of times I am seeing 2 shards are written compare to
> default.
> >> This will be a problem for ZS , may not be for Rocks.
> >>>>>
> >>>>> Initially, I thought blobs are spanning, but, it seems not the case. See
> the
> >> below log snippet , it seems onode itself is bigger now.
> >>>>>
> >>>>> 2016-12-21 18:40:27.734044 7f3e43934700 20
> >> bluestore(/var/lib/ceph/osd/ceph-0)   onode
> >> #1:d1bc6a86:::rbd_data.10046b8b4567.00000000000036a2:head# is 1505
> >> (1503 bytes onode + 2 bytes spanning blobs + 0 bytes inline extents)
> >>>>>
> >>>>> Any idea what's going on ?
> >>>>
> >>>> The onode has a list of the shards.  Since there are more, the onode is
> >> bigger.  I wasn't really expecting the shard count to be that high.  The
> >> structure is:
> >>>>
> >>>>   struct shard_info {
> >>>>     uint32_t offset = 0;  ///< logical offset for start of shard
> >>>>     uint32_t bytes = 0;   ///< encoded bytes
> >>>>     uint32_t extents = 0; ///< extents
> >>>>     DENC(shard_info, v, p) {
> >>>>       denc_varint(v.offset, p);
> >>>>       denc_varint(v.bytes, p);
> >>>>       denc_varint(v.extents, p);
> >>>>     }
> >>>>     void dump(Formatter *f) const;
> >>>>   };
> >>>>   vector<shard_info> extent_map_shards; ///< extent map shards (if
> >>>> any)
> >>>>
> >>>> The offset is the important piece.  The byte and extent counts aren't
> that
> >> important... they're mostly there so that a future reshard operation can
> be
> >> more clever (merging or splitting adjacent shards instead of resharding
> >> everything).  Well, the bytes field is currently used, but extents is not at
> all.
> >> We could just drop that field now and add it (or something else) back in
> later
> >> if/when we need it...
> >>>>
> >>>> sage
> >>>>
> >>>>
> >>>>>
> >>>>> Thanks & Regards
> >>>>> Somnath
> >>>>>
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Somnath Roy
> >>>>> Sent: Friday, December 16, 2016 7:23 PM
> >>>>> To: Sage Weil (sweil@xxxxxxxxxx)
> >>>>> Cc: 'ceph-devel'
> >>>>> Subject: RE: Bluestore performance bottleneck
> >>>>>
> >>>>> Sage,
> >>>>> Some update on this. Without decode_some within fault_range() I
> am
> >> able to drive Bluestore + rocksdb close to ~38K iops compare to ~20K iops
> >> with decode_some. I had to disable data write because I am skipping
> decode
> >> but in this device data write is not a bottleneck. I have seen
> >> enabling/disabling data write is giving similar result. So, in NVME device if
> we
> >> can optimize decode_some() for performance Bluestore performance
> >> should bump up by ~2X.
> >>>>> I did some print around decode_some() and it seems it is taking ~60-
> 121
> >> micro sec to finish depending on bytes to decode.
> >>>>>
> >>>>> Thanks & Regards
> >>>>> Somnath
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Somnath Roy
> >>>>> Sent: Thursday, December 15, 2016 7:30 PM
> >>>>> To: Sage Weil (sweil@xxxxxxxxxx)
> >>>>> Cc: ceph-devel
> >>>>> Subject: Bluestore performance bottleneck
> >>>>>
> >>>>> Sage,
> >>>>> Today morning I was talking about 2x performance drop for Bluestore
> >> without data/db writes for 1G vs 60G volumes and it turn out the
> >> decode_some() is the culprit for that. Presently, I am drilling down that
> >> function to identify what exactly causing this issue, but, most probably it is
> >> blob decode and le->blob->get_ref() combination. Will confirm that soon.
> If
> >> we can fix that we should be able to considerably bump up end-to-end
> pick
> >> performance with rocks/ZS on faster NVME. Slower devices most likely
> we
> >> will not be able to see any benefits other than saving some cpu cost.
> >>>>>
> >>>>> Thanks & Regards
> >>>>> Somnath
> >>>>>
> >>>>>
> >>>>>
> >>>>> ________________________________
> >>>>>
> >>>>> PLEASE NOTE: The information contained in this electronic mail
> message
> >> is intended only for the use of the designated recipient(s) named above.
> If
> >> the reader of this message is not the intended recipient, you are hereby
> >> notified that you have received this message in error and that any review,
> >> dissemination, distribution, or copying of this message is strictly
> prohibited. If
> >> you have received this communication in error, please notify the sender
> by
> >> telephone or e-mail (as shown above) immediately and destroy any and
> all
> >> copies of this message in your possession (whether hard copies or
> >> electronically stored copies).
> >>>>>
> >>>>> --
> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> >> majordomo
> >>>>> info at  http://vger.kernel.org/majordomo-info.html
> >>>>>
> >>>>>
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> >> majordomo
> >>>> info at  http://vger.kernel.org/majordomo-info.html
> >>>>
> >>>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> >> majordomo
> >>> info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the
> >> body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info
> at
> >> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux