RE: Bluestore performance bottleneck REVISITED

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Decode_some won't care how many shards there are, it will only care about the size of a shard. But if the number of shards is small enough to be inlined in the oNode that will definitely explain what we're seeing.

Do we know what the shard-counts are for 16KMinAlloc?


Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@xxxxxxxxxxx


> -----Original Message-----
> From: Somnath Roy
> Sent: Sunday, December 25, 2016 10:38 AM
> To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Mark Nelson
> <mnelson@xxxxxxxxxx>; Sage Weil <sweil@xxxxxxxxxx>
> Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx>
> Subject: RE: Bluestore performance bottleneck REVISITED
> 
> I think one of the reason 16K is doing better is the very less decode_some()
> overhead in case of 16K because in case of 16K min_alloc there is hardly any
> shard.
> Also, what I am seeing (and digging more) that read overhead (onode, shard)
> during write is now having an impact and it is even more while compaction is
> on. For 16K, there will be one read (only onode) vs 4K min_alloc where there
> will be two reads (onode, shard).
> 
> Thanks & Regards
> Somnath
> 
> 
> -----Original Message-----
> From: Allen Samuels
> Sent: Saturday, December 24, 2016 11:45 PM
> To: Mark Nelson; Sage Weil
> Cc: Somnath Roy; ceph-devel
> Subject: RE: Bluestore performance bottleneck REVISITED
> 
> (This is the first time, I've dug into the details of the RocksDB stats, so it's
> possible I'm misinterpreting)
> 
> Even more interesting.
> 
> The 4K and 16K runs both have the same total amount of compaction traffic. I
> look at the "sum" row and the Read(GB) and Write(GB) columns and I note
> that they're almost exactly the same for the 4K and 16K runs.
> 
> If this is correct, then the hypothesis that the 16K Minalloc is faster due to
> the reduced cost of the compaction because of smaller metadata size is
> simply false.
> 
> It is true that the static size of the metadata is much smaller for the 4K run
> (8.3 GB vs. 2.5GB), but this simply doesn't explain the observed results.
> 
> When you factor into the equation the higher ingest for the 16K case
> [ingesting the data and the extra transaction to undo it]  (91GB vs 190GB)
> with the same amount of compaction (~145GB) traffic, then we need to start
> thinking about another explanation for the observed data.
> 
> I'm still digging into the numbers, but I do notice that the # of writes / sync
> are radically different (4.5 writes/sync vs. 31.2 writes/sync), I assume this is
> effectively the same as the I/O size. I believe that somehow the system is
> doing a much better job of batching the transactions for 16K vs 4K and what
> we're really seeing as the performance delta between them is related to this.
> 
> Mark --> Can we look at the Linux iostat data for this, in particular can you
> look at the average write I/O size? And the # of interrupts / front-end I/O
> operation?
> 
> 
> Allen Samuels
> SanDisk |a Western Digital brand
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@xxxxxxxxxxx
> 
> 
> > -----Original Message-----
> > From: Mark Nelson [mailto:mnelson@xxxxxxxxxx]
> > Sent: Friday, December 23, 2016 12:05 PM
> > To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Sage Weil
> > <sweil@xxxxxxxxxx>
> > Cc: Somnath Roy <Somnath.Roy@xxxxxxxxxxx>; ceph-devel <ceph-
> > devel@xxxxxxxxxxxxxxx>
> > Subject: Re: Bluestore performance bottleneck
> >
> >
> >
> > On 12/23/2016 01:33 PM, Allen Samuels wrote:
> > > The two data points you mention (4K / 16K MinAlloc) yield
> > > interesting
> > numbers. For 4K, you're seeing 22.5K IOPS at 1300% CPU or 1.7K IOPS /
> core.
> > Yet for 16K you're seeing 25K IOPS at 1000% CPU or 2.5 K IOPS/Core.
> > Yet, we know that in the main I/O path that the 16K is doing more work
> > (since it's double-writing the data), but is yielding better CPU usage
> > overall. We do know that there will be a reduction of compaction for
> > the 16K case which will save SOME CPU, but I wouldn't have thought
> > that this would be substantial since the data is all processed
> > sequentially in rather large blocks (i.e., the CPU cost of compaction seems
> to be larger than expected).
> > >
> > > Do we know that you're actually capturing a few compaction cycles
> > > with the
> > 16K test? If not, that might explain some of the difference.
> >
> > I believe so.  Here is a comparison of the 25/50 tests for example.
> > Interesting that there's so much more data compacted in the 4K
> > min_alloc tests.
> >
> > 4k min_alloc:
> >
> > > 2016-12-22 19:33:49.722025 7fb188f21700  3 rocksdb: ------- DUMPING
> > > STATS -------
> > > 2016-12-22 19:33:49.722029 7fb188f21700  3 rocksdb:
> > > ** Compaction Stats [default] **
> > > Level    Files   Size(MB} Score Read(GB}  Rn(GB} Rnp1(GB} Write(GB}
> > Wnew(GB} Moved(GB} W-Amp Rd(MB/s} Wr(MB/s} Comp(sec} Comp(cnt}
> > Avg(sec} KeyIn KeyDrop
> > > --------------------------------------------------------------------
> > > -----------------------
> > ---------------------------------------------------------------
> > >   L0      8/0    1440.37   2.0      0.0     0.0      0.0      63.1     63.1       0.0   0.0      0.0
> > 183.8       352       400    0.880       0      0
> > >   L1     15/0     881.27   3.4     82.2    61.7     20.5      30.8     10.4       0.0   0.5
> 128.5
> > 48.2       655        38   17.235    462M    25M
> > >   L2     95/0    5538.28   2.2     55.4     9.0     46.4      51.8      5.3       0.5   5.8     54.3
> > 50.7      1045       136    7.683   1238M    33M
> > >   L3      7/0     458.47   0.0      0.5     0.4      0.1       0.5      0.4       0.0   1.1     59.5
> > 59.5         9         7    1.259     12M      1
> > >  Sum    125/0    8318.40   0.0    138.1    71.2     67.0     146.3     79.3       0.5   2.3
> > 68.6     72.7      2061       581    3.547   1712M    58M
> > >  Int      0/0       0.00   0.0     66.2    38.5     27.6      65.6     38.0       0.0   1.9     87.5
> > 86.8       774       257    3.013    556M    39M
> > > Uptime(secs): 1953.4 total, 1953.4 interval
> > > Flush(GB): cumulative 63.137, interval 35.154
> > > AddFile(GB): cumulative 0.000, interval 0.000 AddFile(Total Files):
> > > cumulative 0, interval 0
> > > AddFile(L0 Files): cumulative 0, interval 0
> > > AddFile(Keys): cumulative 0, interval 0 Cumulative compaction:
> > > 146.26 GB write, 76.67 MB/s write, 138.13 GB read, 72.41 MB/s read,
> > > 2060.5 seconds Interval compaction: 65.64 GB write, 34.41 MB/s
> > > write, 66.16 GB read, 34.68 MB/s read, 774.4 seconds
> > > Stalls(count): 11 level0_slowdown, 11
> > > level0_slowdown_with_compaction,
> > > 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for
> > > pending_compaction_bytes, 0 slowdown for
> pending_compaction_bytes,
> > 0
> > > memtable_compaction, 0 memtable_slowdown, interval 7 total count
> > >
> > > ** DB Stats **
> > > Uptime(secs): 1953.4 total, 603.3 interval Cumulative writes: 18M
> > > writes, 251M keys, 18M commit groups, 1.0 writes per commit group,
> > > ingest: 91.83 GB, 48.14 MB/s Cumulative WAL: 18M writes, 4111K
> > > syncs,
> > > 4.52 writes per sync, written: 91.83 GB, 48.14 MB/s Cumulative stall:
> > > 00:01:7.797 H:M:S, 3.5 percent Interval writes: 10M writes, 69M
> > > keys, 10M commit groups, 1.0 writes per commit group, ingest:
> > > 48121.62 MB,
> > > 79.77 MB/s Interval WAL: 10M writes, 2170K syncs, 4.99 writes per
> > > sync, written: 46.99 MB, 79.77 MB/s Interval stall: 00:00:20.024
> > > H:M:S, 3.3 percent
> >
> >
> > 16k min_alloc:
> >
> > > 2016-12-23 10:20:03.926747 7fef2993d700  3 rocksdb: ------- DUMPING
> > > STATS -------
> > > 2016-12-23 10:20:03.926754 7fef2993d700  3 rocksdb:
> > > ** Compaction Stats [default] **
> > > Level    Files   Size(MB} Score Read(GB}  Rn(GB} Rnp1(GB} Write(GB}
> > Wnew(GB} Moved(GB} W-Amp Rd(MB/s} Wr(MB/s} Comp(sec} Comp(cnt}
> > Avg(sec} KeyIn KeyDrop
> > > --------------------------------------------------------------------
> > > -----------------------
> > ---------------------------------------------------------------
> > >   L0      3/0     186.38   0.8      0.0     0.0      0.0      49.4     49.4       0.0   0.0      0.0
> > 179.0       283       805    0.351       0      0
> > >   L1     13/0     336.75   1.4     80.5    49.2     31.3      41.4     10.2       0.0   0.8
> 139.7
> > 71.9       590       135    4.371    399M    53M
> > >   L2     33/0    1933.96   0.8     62.4     9.4     53.0      54.5      1.4       0.4   5.8     72.0
> > 62.8       887       145    6.120   1039M    70M
> > >  Sum     49/0    2457.09   0.0    142.9    58.6     84.3     145.3     61.0       0.4   2.9
> > 83.1     84.5      1760      1085    1.622   1438M   123M
> > >  Int      0/0       0.00   0.0     61.6    25.1     36.5      61.5     25.0       0.0   2.9     87.6
> > 87.4       720       466    1.545    586M    56M
> > > Uptime(secs): 1951.3 total, 1951.3 interval
> > > Flush(GB): cumulative 49.411, interval 21.131
> > > AddFile(GB): cumulative 0.000, interval 0.000 AddFile(Total Files):
> > > cumulative 0, interval 0
> > > AddFile(L0 Files): cumulative 0, interval 0
> > > AddFile(Keys): cumulative 0, interval 0 Cumulative compaction:
> > > 145.30 GB write, 76.25 MB/s write, 142.90 GB read, 74.99 MB/s read,
> > > 1760.2 seconds Interval compaction: 61.47 GB write, 32.26 MB/s
> > > write, 61.59 GB read, 32.32 MB/s read, 720.0 seconds
> > > Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction,
> > > 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for
> > > pending_compaction_bytes, 0 slowdown for
> pending_compaction_bytes,
> > 0
> > > memtable_compaction, 0 memtable_slowdown, interval 0 total count
> > >
> > > ** DB Stats **
> > > Uptime(secs): 1951.3 total, 604.4 interval Cumulative writes: 32M
> > > writes, 260M keys, 32M commit groups, 1.0 writes per commit group,
> > > ingest: 190.02 GB, 99.72 MB/s Cumulative WAL: 32M writes, 1032K
> > > syncs,
> > > 31.20 writes per sync, written: 190.02 GB, 99.72 MB/s Cumulative
> > > stall: 00:00:0.000 H:M:S, 0.0 percent Interval writes: 14M writes,
> > > 99M keys, 14M commit groups, 1.0 writes per commit group, ingest:
> > > 84136.97 MB, 139.20 MB/s Interval WAL: 14M writes, 268K syncs, 52.14
> > > writes per sync, written: 82.17 MB, 139.20 MB/s Interval stall:
> > > 00:00:0.000 H:M:S, 0.0 percent
> >
> > Mark
> >
> > >
> > >
> > > Allen Samuels
> > > SanDisk |a Western Digital brand
> > > 2880 Junction Avenue, San Jose, CA 95134
> > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx
> > >
> > >
> > >> -----Original Message-----
> > >> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> > >> owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson
> > >> Sent: Friday, December 23, 2016 9:09 AM
> > >> To: Sage Weil <sweil@xxxxxxxxxx>
> > >> Cc: Somnath Roy <Somnath.Roy@xxxxxxxxxxx>; ceph-devel <ceph-
> > >> devel@xxxxxxxxxxxxxxx>
> > >> Subject: Re: Bluestore performance bottleneck
> > >>
> > >>>> Try this?
> > >>>>     https://github.com/ceph/ceph/pull/12634
> > >>>
> > >>> Looks like this is most likely reducing the memory usage and
> > >>> increasing performance quite a bit with smaller shard target/max
> > >>> values.  With
> > >>> 25/50 I'm seeing more like 2.6GB RSS memory usage and around 13K
> > >>> iops typically with some (likely rocksdb) stalls.  I'll run
> > >>> through the tests again.
> > >>>
> > >>> Mark
> > >>>
> > >>
> > >> Ok, Ran through tests with both 4k and 16k min_alloc/max_alloc/blob
> > sizes
> > >> using master+12629+12634:
> > >>
> > >>
> >
> https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZQzdRU3B
> > >> 1SGZUbDQ
> > >>
> > >> Performance is up in all tests and memory consumption is down
> > (especially in
> > >> the smaller target/max tests).  It looks like 100/200 is probably
> > >> the current optimal configuration on my test setup.  4K min_alloc
> > >> tests hover around 22.5K IOPS with ~1300% CPU usage, and 16K
> > >> min_alloc tests hover around 25K IOPs with ~1000% CPU usage.  I
> > >> think it will be worth spending some
> > time
> > >> looking at locking in the bitmap allocator given the perf traces.
> > >> Beyond
> > that,
> > >> I'm seeing rocksdb show up quite a bit in the top CPU consuming
> > functions
> > >> now, especially CRC32.
> > >>
> > >> Mark
> > >>
> > >>
> > >> --
> > >> To unsubscribe from this list: send the line "unsubscribe
> > >> ceph-devel" in
> > the
> > >> body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
> info
> > at
> > >> http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe
> > > ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux