Decode_some won't care how many shards there are, it will only care about the size of a shard. But if the number of shards is small enough to be inlined in the oNode that will definitely explain what we're seeing. Do we know what the shard-counts are for 16KMinAlloc? Allen Samuels SanDisk |a Western Digital brand 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx > -----Original Message----- > From: Somnath Roy > Sent: Sunday, December 25, 2016 10:38 AM > To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Mark Nelson > <mnelson@xxxxxxxxxx>; Sage Weil <sweil@xxxxxxxxxx> > Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx> > Subject: RE: Bluestore performance bottleneck REVISITED > > I think one of the reason 16K is doing better is the very less decode_some() > overhead in case of 16K because in case of 16K min_alloc there is hardly any > shard. > Also, what I am seeing (and digging more) that read overhead (onode, shard) > during write is now having an impact and it is even more while compaction is > on. For 16K, there will be one read (only onode) vs 4K min_alloc where there > will be two reads (onode, shard). > > Thanks & Regards > Somnath > > > -----Original Message----- > From: Allen Samuels > Sent: Saturday, December 24, 2016 11:45 PM > To: Mark Nelson; Sage Weil > Cc: Somnath Roy; ceph-devel > Subject: RE: Bluestore performance bottleneck REVISITED > > (This is the first time, I've dug into the details of the RocksDB stats, so it's > possible I'm misinterpreting) > > Even more interesting. > > The 4K and 16K runs both have the same total amount of compaction traffic. I > look at the "sum" row and the Read(GB) and Write(GB) columns and I note > that they're almost exactly the same for the 4K and 16K runs. > > If this is correct, then the hypothesis that the 16K Minalloc is faster due to > the reduced cost of the compaction because of smaller metadata size is > simply false. > > It is true that the static size of the metadata is much smaller for the 4K run > (8.3 GB vs. 2.5GB), but this simply doesn't explain the observed results. > > When you factor into the equation the higher ingest for the 16K case > [ingesting the data and the extra transaction to undo it] (91GB vs 190GB) > with the same amount of compaction (~145GB) traffic, then we need to start > thinking about another explanation for the observed data. > > I'm still digging into the numbers, but I do notice that the # of writes / sync > are radically different (4.5 writes/sync vs. 31.2 writes/sync), I assume this is > effectively the same as the I/O size. I believe that somehow the system is > doing a much better job of batching the transactions for 16K vs 4K and what > we're really seeing as the performance delta between them is related to this. > > Mark --> Can we look at the Linux iostat data for this, in particular can you > look at the average write I/O size? And the # of interrupts / front-end I/O > operation? > > > Allen Samuels > SanDisk |a Western Digital brand > 2880 Junction Avenue, San Jose, CA 95134 > T: +1 408 801 7030| M: +1 408 780 6416 > allen.samuels@xxxxxxxxxxx > > > > -----Original Message----- > > From: Mark Nelson [mailto:mnelson@xxxxxxxxxx] > > Sent: Friday, December 23, 2016 12:05 PM > > To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Sage Weil > > <sweil@xxxxxxxxxx> > > Cc: Somnath Roy <Somnath.Roy@xxxxxxxxxxx>; ceph-devel <ceph- > > devel@xxxxxxxxxxxxxxx> > > Subject: Re: Bluestore performance bottleneck > > > > > > > > On 12/23/2016 01:33 PM, Allen Samuels wrote: > > > The two data points you mention (4K / 16K MinAlloc) yield > > > interesting > > numbers. For 4K, you're seeing 22.5K IOPS at 1300% CPU or 1.7K IOPS / > core. > > Yet for 16K you're seeing 25K IOPS at 1000% CPU or 2.5 K IOPS/Core. > > Yet, we know that in the main I/O path that the 16K is doing more work > > (since it's double-writing the data), but is yielding better CPU usage > > overall. We do know that there will be a reduction of compaction for > > the 16K case which will save SOME CPU, but I wouldn't have thought > > that this would be substantial since the data is all processed > > sequentially in rather large blocks (i.e., the CPU cost of compaction seems > to be larger than expected). > > > > > > Do we know that you're actually capturing a few compaction cycles > > > with the > > 16K test? If not, that might explain some of the difference. > > > > I believe so. Here is a comparison of the 25/50 tests for example. > > Interesting that there's so much more data compacted in the 4K > > min_alloc tests. > > > > 4k min_alloc: > > > > > 2016-12-22 19:33:49.722025 7fb188f21700 3 rocksdb: ------- DUMPING > > > STATS ------- > > > 2016-12-22 19:33:49.722029 7fb188f21700 3 rocksdb: > > > ** Compaction Stats [default] ** > > > Level Files Size(MB} Score Read(GB} Rn(GB} Rnp1(GB} Write(GB} > > Wnew(GB} Moved(GB} W-Amp Rd(MB/s} Wr(MB/s} Comp(sec} Comp(cnt} > > Avg(sec} KeyIn KeyDrop > > > -------------------------------------------------------------------- > > > ----------------------- > > --------------------------------------------------------------- > > > L0 8/0 1440.37 2.0 0.0 0.0 0.0 63.1 63.1 0.0 0.0 0.0 > > 183.8 352 400 0.880 0 0 > > > L1 15/0 881.27 3.4 82.2 61.7 20.5 30.8 10.4 0.0 0.5 > 128.5 > > 48.2 655 38 17.235 462M 25M > > > L2 95/0 5538.28 2.2 55.4 9.0 46.4 51.8 5.3 0.5 5.8 54.3 > > 50.7 1045 136 7.683 1238M 33M > > > L3 7/0 458.47 0.0 0.5 0.4 0.1 0.5 0.4 0.0 1.1 59.5 > > 59.5 9 7 1.259 12M 1 > > > Sum 125/0 8318.40 0.0 138.1 71.2 67.0 146.3 79.3 0.5 2.3 > > 68.6 72.7 2061 581 3.547 1712M 58M > > > Int 0/0 0.00 0.0 66.2 38.5 27.6 65.6 38.0 0.0 1.9 87.5 > > 86.8 774 257 3.013 556M 39M > > > Uptime(secs): 1953.4 total, 1953.4 interval > > > Flush(GB): cumulative 63.137, interval 35.154 > > > AddFile(GB): cumulative 0.000, interval 0.000 AddFile(Total Files): > > > cumulative 0, interval 0 > > > AddFile(L0 Files): cumulative 0, interval 0 > > > AddFile(Keys): cumulative 0, interval 0 Cumulative compaction: > > > 146.26 GB write, 76.67 MB/s write, 138.13 GB read, 72.41 MB/s read, > > > 2060.5 seconds Interval compaction: 65.64 GB write, 34.41 MB/s > > > write, 66.16 GB read, 34.68 MB/s read, 774.4 seconds > > > Stalls(count): 11 level0_slowdown, 11 > > > level0_slowdown_with_compaction, > > > 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for > > > pending_compaction_bytes, 0 slowdown for > pending_compaction_bytes, > > 0 > > > memtable_compaction, 0 memtable_slowdown, interval 7 total count > > > > > > ** DB Stats ** > > > Uptime(secs): 1953.4 total, 603.3 interval Cumulative writes: 18M > > > writes, 251M keys, 18M commit groups, 1.0 writes per commit group, > > > ingest: 91.83 GB, 48.14 MB/s Cumulative WAL: 18M writes, 4111K > > > syncs, > > > 4.52 writes per sync, written: 91.83 GB, 48.14 MB/s Cumulative stall: > > > 00:01:7.797 H:M:S, 3.5 percent Interval writes: 10M writes, 69M > > > keys, 10M commit groups, 1.0 writes per commit group, ingest: > > > 48121.62 MB, > > > 79.77 MB/s Interval WAL: 10M writes, 2170K syncs, 4.99 writes per > > > sync, written: 46.99 MB, 79.77 MB/s Interval stall: 00:00:20.024 > > > H:M:S, 3.3 percent > > > > > > 16k min_alloc: > > > > > 2016-12-23 10:20:03.926747 7fef2993d700 3 rocksdb: ------- DUMPING > > > STATS ------- > > > 2016-12-23 10:20:03.926754 7fef2993d700 3 rocksdb: > > > ** Compaction Stats [default] ** > > > Level Files Size(MB} Score Read(GB} Rn(GB} Rnp1(GB} Write(GB} > > Wnew(GB} Moved(GB} W-Amp Rd(MB/s} Wr(MB/s} Comp(sec} Comp(cnt} > > Avg(sec} KeyIn KeyDrop > > > -------------------------------------------------------------------- > > > ----------------------- > > --------------------------------------------------------------- > > > L0 3/0 186.38 0.8 0.0 0.0 0.0 49.4 49.4 0.0 0.0 0.0 > > 179.0 283 805 0.351 0 0 > > > L1 13/0 336.75 1.4 80.5 49.2 31.3 41.4 10.2 0.0 0.8 > 139.7 > > 71.9 590 135 4.371 399M 53M > > > L2 33/0 1933.96 0.8 62.4 9.4 53.0 54.5 1.4 0.4 5.8 72.0 > > 62.8 887 145 6.120 1039M 70M > > > Sum 49/0 2457.09 0.0 142.9 58.6 84.3 145.3 61.0 0.4 2.9 > > 83.1 84.5 1760 1085 1.622 1438M 123M > > > Int 0/0 0.00 0.0 61.6 25.1 36.5 61.5 25.0 0.0 2.9 87.6 > > 87.4 720 466 1.545 586M 56M > > > Uptime(secs): 1951.3 total, 1951.3 interval > > > Flush(GB): cumulative 49.411, interval 21.131 > > > AddFile(GB): cumulative 0.000, interval 0.000 AddFile(Total Files): > > > cumulative 0, interval 0 > > > AddFile(L0 Files): cumulative 0, interval 0 > > > AddFile(Keys): cumulative 0, interval 0 Cumulative compaction: > > > 145.30 GB write, 76.25 MB/s write, 142.90 GB read, 74.99 MB/s read, > > > 1760.2 seconds Interval compaction: 61.47 GB write, 32.26 MB/s > > > write, 61.59 GB read, 32.32 MB/s read, 720.0 seconds > > > Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, > > > 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for > > > pending_compaction_bytes, 0 slowdown for > pending_compaction_bytes, > > 0 > > > memtable_compaction, 0 memtable_slowdown, interval 0 total count > > > > > > ** DB Stats ** > > > Uptime(secs): 1951.3 total, 604.4 interval Cumulative writes: 32M > > > writes, 260M keys, 32M commit groups, 1.0 writes per commit group, > > > ingest: 190.02 GB, 99.72 MB/s Cumulative WAL: 32M writes, 1032K > > > syncs, > > > 31.20 writes per sync, written: 190.02 GB, 99.72 MB/s Cumulative > > > stall: 00:00:0.000 H:M:S, 0.0 percent Interval writes: 14M writes, > > > 99M keys, 14M commit groups, 1.0 writes per commit group, ingest: > > > 84136.97 MB, 139.20 MB/s Interval WAL: 14M writes, 268K syncs, 52.14 > > > writes per sync, written: 82.17 MB, 139.20 MB/s Interval stall: > > > 00:00:0.000 H:M:S, 0.0 percent > > > > Mark > > > > > > > > > > > Allen Samuels > > > SanDisk |a Western Digital brand > > > 2880 Junction Avenue, San Jose, CA 95134 > > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx > > > > > > > > >> -----Original Message----- > > >> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel- > > >> owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson > > >> Sent: Friday, December 23, 2016 9:09 AM > > >> To: Sage Weil <sweil@xxxxxxxxxx> > > >> Cc: Somnath Roy <Somnath.Roy@xxxxxxxxxxx>; ceph-devel <ceph- > > >> devel@xxxxxxxxxxxxxxx> > > >> Subject: Re: Bluestore performance bottleneck > > >> > > >>>> Try this? > > >>>> https://github.com/ceph/ceph/pull/12634 > > >>> > > >>> Looks like this is most likely reducing the memory usage and > > >>> increasing performance quite a bit with smaller shard target/max > > >>> values. With > > >>> 25/50 I'm seeing more like 2.6GB RSS memory usage and around 13K > > >>> iops typically with some (likely rocksdb) stalls. I'll run > > >>> through the tests again. > > >>> > > >>> Mark > > >>> > > >> > > >> Ok, Ran through tests with both 4k and 16k min_alloc/max_alloc/blob > > sizes > > >> using master+12629+12634: > > >> > > >> > > > https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZQzdRU3B > > >> 1SGZUbDQ > > >> > > >> Performance is up in all tests and memory consumption is down > > (especially in > > >> the smaller target/max tests). It looks like 100/200 is probably > > >> the current optimal configuration on my test setup. 4K min_alloc > > >> tests hover around 22.5K IOPS with ~1300% CPU usage, and 16K > > >> min_alloc tests hover around 25K IOPs with ~1000% CPU usage. I > > >> think it will be worth spending some > > time > > >> looking at locking in the bitmap allocator given the perf traces. > > >> Beyond > > that, > > >> I'm seeing rocksdb show up quite a bit in the top CPU consuming > > functions > > >> now, especially CRC32. > > >> > > >> Mark > > >> > > >> > > >> -- > > >> To unsubscribe from this list: send the line "unsubscribe > > >> ceph-devel" in > > the > > >> body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo > info > > at > > >> http://vger.kernel.org/majordomo-info.html > > > -- > > > To unsubscribe from this list: send the line "unsubscribe > > > ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html