It would be good to know if the same memory consumption deltas are visible in the various mempool pools. If not, we have some data structures that need to be mempool-ized. Allen Samuels SanDisk |a Western Digital brand 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx > -----Original Message----- > From: Mark Nelson [mailto:mnelson@xxxxxxxxxx] > Sent: Thursday, December 22, 2016 3:13 PM > To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Somnath Roy > <Somnath.Roy@xxxxxxxxxxx>; Sage Weil <sweil@xxxxxxxxxx> > Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx> > Subject: Re: Bluestore performance bottleneck > > I'm compiling a new branch based on a couple of new PRs and will retest that > will probably alter the memory and CPU usage somewhat. If it's still there I'll > track it down in massif and we'll see what we find. > > Mark > > On 12/22/2016 05:10 PM, Allen Samuels wrote: > > Dramatic changes to the RSS usage due to changes in these parameters > seems completely terrifying to me. Seems like something about the oNode > trimming logic isn't working correctly. > > > > Allen Samuels > > SanDisk |a Western Digital brand > > 2880 Junction Avenue, San Jose, CA 95134 > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx > > > > > >> -----Original Message----- > >> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel- > >> owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson > >> Sent: Thursday, December 22, 2016 2:23 PM > >> To: Somnath Roy <Somnath.Roy@xxxxxxxxxxx>; Sage Weil > >> <sweil@xxxxxxxxxx> > >> Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx> > >> Subject: Re: Bluestore performance bottleneck > >> > >> Hi Somnath, > >> > >> Based on your testing, I went through and did some single OSD tests with > >> master (pre-extent patch) with different sharding target/max settings on > >> one of our NVMe nodes: > >> > >> > https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZb2ZWcVZ > >> HbzJRVHc > >> > >> What I saw is that for 4k min_alloc/max_alloc/max_blob sizes, decreasing > the > >> sharding target/max helped to a point where it started hurting more than > it > >> helped. The peak is probably somewhere between 100/200 and 200/400, > >> though we may want to error on higher values rather than lower. > >> RSS memory usage of the OSD increased dramatically as the target/max > >> sizes shrunk. CPU usage didn't change dramatically, though was a little > lower > >> at the extremes where performance was lowest. > >> > >> For reference, 16k min_alloc pegs at around 20K IOPS in this test as well, > >> meaning that I think we may be hitting a common bottleneck holding us to > >> 20K write IOPS per OSD. > >> > >> I noticed that as the target/max size shrunk, certain code paths became > >> more heavily worked however. RocksDB generally took about a 2x larger > >> percentage of the used CPU, with a lot of it going toward CRC calculations. > >> We also spent a lot more time in BlueStore::ExtentMap::init_shards doing > >> key appends, and triming the TwoQCache. Given that the IOPS dropped > >> precipitously, while overall CPU usage remained high and memory usage > >> increased dramatically, there may be some opportunities to tune these > areas > >> of the code. One example might be to avoid doing string appends in the > key > >> encoding by switching to a different data structure. > >> > >> FWIW, I did not notice any resharding during the steady state for any of > >> these tests. > >> > >> Mark > >> > >> On 12/21/2016 08:25 PM, Somnath Roy wrote: > >>> << How many blobs are in each shard, and how many shards are there? > >>> Is there any easy way to find out these other than adding some log ? > >>> > >>> > >>> -----Original Message----- > >>> From: Sage Weil [mailto:sweil@xxxxxxxxxx] > >>> Sent: Wednesday, December 21, 2016 5:30 PM > >>> To: Somnath Roy > >>> Cc: ceph-devel > >>> Subject: RE: Bluestore performance bottleneck > >>> > >>> How many blobs are in each shard, and how many shards are there? > >>> > >>> If we go this route, I think we'll want a larger threshold for the inline > blobs > >> (stored in the onode key) so that "normal" objects without a zillion blobs > still > >> fit in one key... > >>> > >>> sage > >>> > >>> On Thu, 22 Dec 2016, Somnath Roy wrote: > >>> > >>>> Ok, *205 bytes* reduction per IO by removing extents.. Thanks ! > >>>> > >>>> 2016-12-21 20:00:07.701845 7fcff8412700 30 submit_transaction > Rocksdb > >> transaction: > >>>> Put( Prefix = M key = > >>>> 0x00000000000006fc'.0000000009.00000000000000001051' Value size = > >>>> 182) Put( Prefix = M key = 0x00000000000006fc'._fastinfo' Value size > >>>> = 186) Put( Prefix = O key = > >>>> > >> > 0x7f8000000000000001b56f21ae217262'd_data.10046b8b4567.0000000000002 > >> 8 > >>>> c 6!='0xfffffffffffffffeffffffffffffffff6f0005f000'x' Value size = > >>>> 45) Put( Prefix = O key = > >>>> > >> > 0x7f8000000000000001b56f21ae217262'd_data.10046b8b4567.0000000000002 > >> 8 > >>>> c 6!='0xfffffffffffffffeffffffffffffffff'o' Value size = 1300) Merge( > >>>> Prefix = b key = 0x0000000c8cd00000 Value size = 16) Merge( Prefix = > >>>> b key = 0x0000001067700000 Value size = 16) > >>>> > >>>> > >>>> > >>>> -----Original Message----- > >>>> From: ceph-devel-owner@xxxxxxxxxxxxxxx > >>>> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Somnath > Roy > >>>> Sent: Wednesday, December 21, 2016 4:39 PM > >>>> To: Sage Weil > >>>> Cc: ceph-devel > >>>> Subject: RE: Bluestore performance bottleneck > >>>> > >>>> Yeah, make sense I missed it..I will remove extents and see how much > we > >> can save. > >>>> But, why a 4K length/offset is started touching 2 shards now if shard is > >> smaller is still unclear to me ? > >>>> > >>>> Thanks & Regards > >>>> Somnath > >>>> > >>>> -----Original Message----- > >>>> From: Sage Weil [mailto:sweil@xxxxxxxxxx] > >>>> Sent: Wednesday, December 21, 2016 4:21 PM > >>>> To: Somnath Roy > >>>> Cc: ceph-devel > >>>> Subject: RE: Bluestore performance bottleneck > >>>> > >>>> On Thu, 22 Dec 2016, Somnath Roy wrote: > >>>>> Sage, > >>>>> By reducing shard size I am able to improve bluestore +rocks > >> performance by 80% for 60G image. Will do detailed analysis on bigger > >> images. > >>>>> > >>>>> Here is what I changed to reduce decode_some() overhead. It is now > >> looping 5 times instead of default 33. > >>>>> > >>>>> bluestore_extent_map_shard_max_size = 50 > >>>>> bluestore_extent_map_shard_target_size = 45 > >>>>> > >>>>> Fio output : > >>>>> ------------- > >>>>> Default: > >>>>> ---------- > >>>>> > >>>>> rbd_iodepth32: (groupid=0, jobs=5): err= 0: pid=34420: Wed Dec > 21 > >> 17:30:50 2016 > >>>>> write: io=114208MB, bw=63615KB/s, iops=15903, runt=1838384msec > >>>>> slat (usec): min=3, max=3878, avg=15.63, stdev= 8.78 > >>>>> clat (usec): min=618, max=176057, avg=20092.56, stdev=21017.34 > >>>>> lat (usec): min=642, max=176063, avg=20108.20, stdev=21017.29 > >>>>> clat percentiles (usec): > >>>>> | 1.00th=[ 1416], 5.00th=[ 2928], 10.00th=[ 4320], 20.00th=[ 6112], > >>>>> | 30.00th=[ 7648], 40.00th=[ 9152], 50.00th=[10944], > 60.00th=[13760], > >>>>> | 70.00th=[18816], 80.00th=[32384], 90.00th=[55552], > 95.00th=[68096], > >>>>> | 99.00th=[87552], 99.50th=[94720], 99.90th=[121344], > >> 99.95th=[129536], > >>>>> | 99.99th=[142336] > >>>>> > >>>>> Small shards: > >>>>> ---------------- > >>>>> > >>>>> rbd_iodepth32: (groupid=0, jobs=5): err= 0: pid=34767: Wed Dec 21 > >> 18:50:30 2016 > >>>>> write: io=186917MB, bw=110585KB/s, iops=27646, > runt=1730819msec > >>>>> slat (usec): min=2, max=1447, avg=14.95, stdev= 7.64 > >>>>> clat (usec): min=531, max=541140, avg=11547.88, stdev=8131.72 > >>>>> lat (usec): min=544, max=541156, avg=11562.83, stdev=8131.73 > >>>>> clat percentiles (msec): > >>>>> | 1.00th=[ 3], 5.00th=[ 4], 10.00th=[ 5], 20.00th=[ 6], > >>>>> | 30.00th=[ 7], 40.00th=[ 9], 50.00th=[ 10], 60.00th=[ 12], > >>>>> | 70.00th=[ 14], 80.00th=[ 17], 90.00th=[ 22], 95.00th=[ 27], > >>>>> | 99.00th=[ 37], 99.50th=[ 40], 99.90th=[ 51], 99.95th=[ 75], > >>>>> | 99.99th=[ 208] > >>>>> > >>>>> > >>>>> > >>>>> *But* here is the overhead I am seeing which I don't quite > understand. > >> See the per io metadata overhead for onode/shard is with smaller shard is > >> ~30% more. > >>>>> > >>>>> Default: > >>>>> ---------- > >>>>> > >>>>> 2016-12-21 17:22:31.693503 7f18aa7c1700 30 submit_transaction > Rocksdb > >> transaction: > >>>>> Put( Prefix = M key = > >>>>> 0x000000000000048a'.0000000009.00000000000000019093' Value size = > >>>>> 182) Put( Prefix = M key = 0x000000000000048a'._fastinfo' Value size > >>>>> = 186) Put( Prefix = O key = > >>>>> > >> > 0x7f8000000000000001dc3f23'`!rbd_data.10046b8b4567.00000000000024b1!=' > >>>>> 0xfffffffffffffffeffffffffffffffff6f0014c000'x' Value size = 669) > >>>>> Put( Prefix = O key = > >>>>> > >> > 0x7f8000000000000001dc3f23'`!rbd_data.10046b8b4567.00000000000024b1!=' > >>>>> 0xfffffffffffffffeffffffffffffffff'o' Value size = 462) Merge( > >>>>> Prefix = b key = 0x000000069ae00000 Value size = 16) Merge( Prefix = > >>>>> b key = > >>>>> 0x0000001330d00000 Value size = 16) > >>>>> > >>>>> > >>>>> Smaller shard: > >>>>> ----------------- > >>>>> > >>>>> 2016-12-21 18:52:18.564423 7f3e5d167700 30 submit_transaction > >> Rocksdb transaction: > >>>>> Put( Prefix = M key = > >>>>> 0x0000000000000691'.0000000009.00000000000000057248' Value size = > >>>>> 182) Put( Prefix = M key = 0x0000000000000691'._fastinfo' Value size > >>>>> = 186) Put( Prefix = O key = > >>>>> > >> > 0x7f8000000000000001427b9fe9217262'd_data.10046b8b4567.0000000000003 > >>>>> 48 f!='0xfffffffffffffffeffffffffffffffff6f00195000'x' Value size = > >>>>> 45) Put( Prefix = O key = > >>>>> > >> > 0x7f8000000000000001427b9fe9217262'd_data.10046b8b4567.0000000000003 > >>>>> 48 f!='0xfffffffffffffffeffffffffffffffff6f0019a000'x' Value size = > >>>>> 45) Put( Prefix = O key = > >>>>> > >> > 0x7f8000000000000001427b9fe9217262'd_data.10046b8b4567.0000000000003 > >>>>> 48 f!='0xfffffffffffffffeffffffffffffffff'o' Value size = 1505) > >>>>> Merge( Prefix = b key = 0x00000006b9780000 Value size = 16) Merge( > >>>>> Prefix = b key = 0x0000000c64f00000 Value size = 16) > >>>>> > >>>>> > >>>>> *And* lot of times I am seeing 2 shards are written compare to > default. > >> This will be a problem for ZS , may not be for Rocks. > >>>>> > >>>>> Initially, I thought blobs are spanning, but, it seems not the case. See > the > >> below log snippet , it seems onode itself is bigger now. > >>>>> > >>>>> 2016-12-21 18:40:27.734044 7f3e43934700 20 > >> bluestore(/var/lib/ceph/osd/ceph-0) onode > >> #1:d1bc6a86:::rbd_data.10046b8b4567.00000000000036a2:head# is 1505 > >> (1503 bytes onode + 2 bytes spanning blobs + 0 bytes inline extents) > >>>>> > >>>>> Any idea what's going on ? > >>>> > >>>> The onode has a list of the shards. Since there are more, the onode is > >> bigger. I wasn't really expecting the shard count to be that high. The > >> structure is: > >>>> > >>>> struct shard_info { > >>>> uint32_t offset = 0; ///< logical offset for start of shard > >>>> uint32_t bytes = 0; ///< encoded bytes > >>>> uint32_t extents = 0; ///< extents > >>>> DENC(shard_info, v, p) { > >>>> denc_varint(v.offset, p); > >>>> denc_varint(v.bytes, p); > >>>> denc_varint(v.extents, p); > >>>> } > >>>> void dump(Formatter *f) const; > >>>> }; > >>>> vector<shard_info> extent_map_shards; ///< extent map shards (if > >>>> any) > >>>> > >>>> The offset is the important piece. The byte and extent counts aren't > that > >> important... they're mostly there so that a future reshard operation can > be > >> more clever (merging or splitting adjacent shards instead of resharding > >> everything). Well, the bytes field is currently used, but extents is not at > all. > >> We could just drop that field now and add it (or something else) back in > later > >> if/when we need it... > >>>> > >>>> sage > >>>> > >>>> > >>>>> > >>>>> Thanks & Regards > >>>>> Somnath > >>>>> > >>>>> > >>>>> -----Original Message----- > >>>>> From: Somnath Roy > >>>>> Sent: Friday, December 16, 2016 7:23 PM > >>>>> To: Sage Weil (sweil@xxxxxxxxxx) > >>>>> Cc: 'ceph-devel' > >>>>> Subject: RE: Bluestore performance bottleneck > >>>>> > >>>>> Sage, > >>>>> Some update on this. Without decode_some within fault_range() I > am > >> able to drive Bluestore + rocksdb close to ~38K iops compare to ~20K iops > >> with decode_some. I had to disable data write because I am skipping > decode > >> but in this device data write is not a bottleneck. I have seen > >> enabling/disabling data write is giving similar result. So, in NVME device if > we > >> can optimize decode_some() for performance Bluestore performance > >> should bump up by ~2X. > >>>>> I did some print around decode_some() and it seems it is taking ~60- > 121 > >> micro sec to finish depending on bytes to decode. > >>>>> > >>>>> Thanks & Regards > >>>>> Somnath > >>>>> > >>>>> -----Original Message----- > >>>>> From: Somnath Roy > >>>>> Sent: Thursday, December 15, 2016 7:30 PM > >>>>> To: Sage Weil (sweil@xxxxxxxxxx) > >>>>> Cc: ceph-devel > >>>>> Subject: Bluestore performance bottleneck > >>>>> > >>>>> Sage, > >>>>> Today morning I was talking about 2x performance drop for Bluestore > >> without data/db writes for 1G vs 60G volumes and it turn out the > >> decode_some() is the culprit for that. Presently, I am drilling down that > >> function to identify what exactly causing this issue, but, most probably it is > >> blob decode and le->blob->get_ref() combination. Will confirm that soon. > If > >> we can fix that we should be able to considerably bump up end-to-end > pick > >> performance with rocks/ZS on faster NVME. Slower devices most likely > we > >> will not be able to see any benefits other than saving some cpu cost. > >>>>> > >>>>> Thanks & Regards > >>>>> Somnath > >>>>> > >>>>> > >>>>> > >>>>> ________________________________ > >>>>> > >>>>> PLEASE NOTE: The information contained in this electronic mail > message > >> is intended only for the use of the designated recipient(s) named above. > If > >> the reader of this message is not the intended recipient, you are hereby > >> notified that you have received this message in error and that any review, > >> dissemination, distribution, or copying of this message is strictly > prohibited. If > >> you have received this communication in error, please notify the sender > by > >> telephone or e-mail (as shown above) immediately and destroy any and > all > >> copies of this message in your possession (whether hard copies or > >> electronically stored copies). > >>>>> > >>>>> -- > >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" > >>>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More > >> majordomo > >>>>> info at http://vger.kernel.org/majordomo-info.html > >>>>> > >>>>> > >>>> -- > >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" > >>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More > >> majordomo > >>>> info at http://vger.kernel.org/majordomo-info.html > >>>> > >>>> > >>> -- > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" > >>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More > >> majordomo > >>> info at http://vger.kernel.org/majordomo-info.html > >>> > >> -- > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the > >> body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info > at > >> http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html