RE: Bluestore performance bottleneck REVISITED

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Sun, 25 Dec 2016 07:45:01 +0000

(This is the first time, I've dug into the details of the RocksDB stats, so it's possible I'm misinterpreting) 

Even more interesting.

The 4K and 16K runs both have the same total amount of compaction traffic. I look at the "sum" row and the Read(GB) and Write(GB) columns and I note that they're almost exactly the same for the 4K and 16K runs.

If this is correct, then the hypothesis that the 16K Minalloc is faster due to the reduced cost of the compaction because of smaller metadata size is simply false.

It is true that the static size of the metadata is much smaller for the 4K run (8.3 GB vs. 2.5GB), but this simply doesn't explain the observed results.

When you factor into the equation the higher ingest for the 16K case [ingesting the data and the extra transaction to undo it]  (91GB vs 190GB) with the same amount of compaction (~145GB) traffic, then we need to start thinking about another explanation for the observed data.

I'm still digging into the numbers, but I do notice that the # of writes / sync are radically different (4.5 writes/sync vs. 31.2 writes/sync), I assume this is effectively the same as the I/O size. I believe that somehow the system is doing a much better job of batching the transactions for 16K vs 4K and what we're really seeing as the performance delta between them is related to this.

Mark --> Can we look at the Linux iostat data for this, in particular can you look at the average write I/O size? And the # of interrupts / front-end I/O operation?

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@xxxxxxxxxxx

> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@xxxxxxxxxx]
> Sent: Friday, December 23, 2016 12:05 PM
> To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Sage Weil
> <sweil@xxxxxxxxxx>
> Cc: Somnath Roy <Somnath.Roy@xxxxxxxxxxx>; ceph-devel <ceph-
> devel@xxxxxxxxxxxxxxx>
> Subject: Re: Bluestore performance bottleneck
> 
> 
> 
> On 12/23/2016 01:33 PM, Allen Samuels wrote:
> > The two data points you mention (4K / 16K MinAlloc) yield interesting
> numbers. For 4K, you're seeing 22.5K IOPS at 1300% CPU or 1.7K IOPS / core.
> Yet for 16K you're seeing 25K IOPS at 1000% CPU or 2.5 K IOPS/Core.  Yet, we
> know that in the main I/O path that the 16K is doing more work (since it's
> double-writing the data), but is yielding better CPU usage overall. We do
> know that there will be a reduction of compaction for the 16K case which will
> save SOME CPU, but I wouldn't have thought that this would be substantial
> since the data is all processed sequentially in rather large blocks (i.e., the CPU
> cost of compaction seems to be larger than expected).
> >
> > Do we know that you're actually capturing a few compaction cycles with the
> 16K test? If not, that might explain some of the difference.
> 
> I believe so.  Here is a comparison of the 25/50 tests for example.
> Interesting that there's so much more data compacted in the 4K min_alloc
> tests.
> 
> 4k min_alloc:
> 
> > 2016-12-22 19:33:49.722025 7fb188f21700  3 rocksdb: ------- DUMPING
> > STATS -------
> > 2016-12-22 19:33:49.722029 7fb188f21700  3 rocksdb:
> > ** Compaction Stats [default] **
> > Level    Files   Size(MB} Score Read(GB}  Rn(GB} Rnp1(GB} Write(GB}
> Wnew(GB} Moved(GB} W-Amp Rd(MB/s} Wr(MB/s} Comp(sec} Comp(cnt}
> Avg(sec} KeyIn KeyDrop
> > -------------------------------------------------------------------------------------------
> ---------------------------------------------------------------
> >   L0      8/0    1440.37   2.0      0.0     0.0      0.0      63.1     63.1       0.0   0.0      0.0
> 183.8       352       400    0.880       0      0
> >   L1     15/0     881.27   3.4     82.2    61.7     20.5      30.8     10.4       0.0   0.5    128.5
> 48.2       655        38   17.235    462M    25M
> >   L2     95/0    5538.28   2.2     55.4     9.0     46.4      51.8      5.3       0.5   5.8     54.3
> 50.7      1045       136    7.683   1238M    33M
> >   L3      7/0     458.47   0.0      0.5     0.4      0.1       0.5      0.4       0.0   1.1     59.5
> 59.5         9         7    1.259     12M      1
> >  Sum    125/0    8318.40   0.0    138.1    71.2     67.0     146.3     79.3       0.5   2.3
> 68.6     72.7      2061       581    3.547   1712M    58M
> >  Int      0/0       0.00   0.0     66.2    38.5     27.6      65.6     38.0       0.0   1.9     87.5
> 86.8       774       257    3.013    556M    39M
> > Uptime(secs): 1953.4 total, 1953.4 interval
> > Flush(GB): cumulative 63.137, interval 35.154
> > AddFile(GB): cumulative 0.000, interval 0.000 AddFile(Total Files):
> > cumulative 0, interval 0
> > AddFile(L0 Files): cumulative 0, interval 0
> > AddFile(Keys): cumulative 0, interval 0 Cumulative compaction: 146.26
> > GB write, 76.67 MB/s write, 138.13 GB read, 72.41 MB/s read, 2060.5
> > seconds Interval compaction: 65.64 GB write, 34.41 MB/s write, 66.16
> > GB read, 34.68 MB/s read, 774.4 seconds
> > Stalls(count): 11 level0_slowdown, 11 level0_slowdown_with_compaction,
> > 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for
> > pending_compaction_bytes, 0 slowdown for pending_compaction_bytes,
> 0
> > memtable_compaction, 0 memtable_slowdown, interval 7 total count
> >
> > ** DB Stats **
> > Uptime(secs): 1953.4 total, 603.3 interval Cumulative writes: 18M
> > writes, 251M keys, 18M commit groups, 1.0 writes per commit group,
> > ingest: 91.83 GB, 48.14 MB/s Cumulative WAL: 18M writes, 4111K syncs,
> > 4.52 writes per sync, written: 91.83 GB, 48.14 MB/s Cumulative stall:
> > 00:01:7.797 H:M:S, 3.5 percent Interval writes: 10M writes, 69M keys,
> > 10M commit groups, 1.0 writes per commit group, ingest: 48121.62 MB,
> > 79.77 MB/s Interval WAL: 10M writes, 2170K syncs, 4.99 writes per
> > sync, written: 46.99 MB, 79.77 MB/s Interval stall: 00:00:20.024
> > H:M:S, 3.3 percent
> 
> 
> 16k min_alloc:
> 
> > 2016-12-23 10:20:03.926747 7fef2993d700  3 rocksdb: ------- DUMPING
> > STATS -------
> > 2016-12-23 10:20:03.926754 7fef2993d700  3 rocksdb:
> > ** Compaction Stats [default] **
> > Level    Files   Size(MB} Score Read(GB}  Rn(GB} Rnp1(GB} Write(GB}
> Wnew(GB} Moved(GB} W-Amp Rd(MB/s} Wr(MB/s} Comp(sec} Comp(cnt}
> Avg(sec} KeyIn KeyDrop
> > -------------------------------------------------------------------------------------------
> ---------------------------------------------------------------
> >   L0      3/0     186.38   0.8      0.0     0.0      0.0      49.4     49.4       0.0   0.0      0.0
> 179.0       283       805    0.351       0      0
> >   L1     13/0     336.75   1.4     80.5    49.2     31.3      41.4     10.2       0.0   0.8    139.7
> 71.9       590       135    4.371    399M    53M
> >   L2     33/0    1933.96   0.8     62.4     9.4     53.0      54.5      1.4       0.4   5.8     72.0
> 62.8       887       145    6.120   1039M    70M
> >  Sum     49/0    2457.09   0.0    142.9    58.6     84.3     145.3     61.0       0.4   2.9
> 83.1     84.5      1760      1085    1.622   1438M   123M
> >  Int      0/0       0.00   0.0     61.6    25.1     36.5      61.5     25.0       0.0   2.9     87.6
> 87.4       720       466    1.545    586M    56M
> > Uptime(secs): 1951.3 total, 1951.3 interval
> > Flush(GB): cumulative 49.411, interval 21.131
> > AddFile(GB): cumulative 0.000, interval 0.000 AddFile(Total Files):
> > cumulative 0, interval 0
> > AddFile(L0 Files): cumulative 0, interval 0
> > AddFile(Keys): cumulative 0, interval 0 Cumulative compaction: 145.30
> > GB write, 76.25 MB/s write, 142.90 GB read, 74.99 MB/s read, 1760.2
> > seconds Interval compaction: 61.47 GB write, 32.26 MB/s write, 61.59
> > GB read, 32.32 MB/s read, 720.0 seconds
> > Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0
> > level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for
> > pending_compaction_bytes, 0 slowdown for pending_compaction_bytes,
> 0
> > memtable_compaction, 0 memtable_slowdown, interval 0 total count
> >
> > ** DB Stats **
> > Uptime(secs): 1951.3 total, 604.4 interval Cumulative writes: 32M
> > writes, 260M keys, 32M commit groups, 1.0 writes per commit group,
> > ingest: 190.02 GB, 99.72 MB/s Cumulative WAL: 32M writes, 1032K syncs,
> > 31.20 writes per sync, written: 190.02 GB, 99.72 MB/s Cumulative
> > stall: 00:00:0.000 H:M:S, 0.0 percent Interval writes: 14M writes, 99M
> > keys, 14M commit groups, 1.0 writes per commit group, ingest: 84136.97
> > MB, 139.20 MB/s Interval WAL: 14M writes, 268K syncs, 52.14 writes per
> > sync, written: 82.17 MB, 139.20 MB/s Interval stall: 00:00:0.000
> > H:M:S, 0.0 percent
> 
> Mark
> 
> >
> >
> > Allen Samuels
> > SanDisk |a Western Digital brand
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx
> >
> >
> >> -----Original Message-----
> >> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> >> owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson
> >> Sent: Friday, December 23, 2016 9:09 AM
> >> To: Sage Weil <sweil@xxxxxxxxxx>
> >> Cc: Somnath Roy <Somnath.Roy@xxxxxxxxxxx>; ceph-devel <ceph-
> >> devel@xxxxxxxxxxxxxxx>
> >> Subject: Re: Bluestore performance bottleneck
> >>
> >>>> Try this?
> >>>>     https://github.com/ceph/ceph/pull/12634
> >>>
> >>> Looks like this is most likely reducing the memory usage and
> >>> increasing performance quite a bit with smaller shard target/max
> >>> values.  With
> >>> 25/50 I'm seeing more like 2.6GB RSS memory usage and around 13K iops
> >>> typically with some (likely rocksdb) stalls.  I'll run through the
> >>> tests again.
> >>>
> >>> Mark
> >>>
> >>
> >> Ok, Ran through tests with both 4k and 16k min_alloc/max_alloc/blob
> sizes
> >> using master+12629+12634:
> >>
> >>
> https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZQzdRU3B
> >> 1SGZUbDQ
> >>
> >> Performance is up in all tests and memory consumption is down
> (especially in
> >> the smaller target/max tests).  It looks like 100/200 is probably the current
> >> optimal configuration on my test setup.  4K min_alloc tests hover around
> >> 22.5K IOPS with ~1300% CPU usage, and 16K min_alloc tests hover around
> >> 25K IOPs with ~1000% CPU usage.  I think it will be worth spending some
> time
> >> looking at locking in the bitmap allocator given the perf traces.  Beyond
> that,
> >> I'm seeing rocksdb show up quite a bit in the top CPU consuming
> functions
> >> now, especially CRC32.
> >>
> >> Mark
> >>
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the
> >> body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info
> at
> >> http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html