On 12/23/2016 01:33 PM, Allen Samuels wrote:
The two data points you mention (4K / 16K MinAlloc) yield interesting numbers. For 4K, you're seeing 22.5K IOPS at 1300% CPU or 1.7K IOPS / core. Yet for 16K you're seeing 25K IOPS at 1000% CPU or 2.5 K IOPS/Core. Yet, we know that in the main I/O path that the 16K is doing more work (since it's double-writing the data), but is yielding better CPU usage overall. We do know that there will be a reduction of compaction for the 16K case which will save SOME CPU, but I wouldn't have thought that this would be substantial since the data is all processed sequentially in rather large blocks (i.e., the CPU cost of compaction seems to be larger than expected).
Do we know that you're actually capturing a few compaction cycles with the 16K test? If not, that might explain some of the difference.
I believe so. Here is a comparison of the 25/50 tests for example.
Interesting that there's so much more data compacted in the 4K min_alloc
tests.
4k min_alloc:
2016-12-22 19:33:49.722025 7fb188f21700 3 rocksdb: ------- DUMPING STATS -------
2016-12-22 19:33:49.722029 7fb188f21700 3 rocksdb:
** Compaction Stats [default] **
Level Files Size(MB} Score Read(GB} Rn(GB} Rnp1(GB} Write(GB} Wnew(GB} Moved(GB} W-Amp Rd(MB/s} Wr(MB/s} Comp(sec} Comp(cnt} Avg(sec} KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
L0 8/0 1440.37 2.0 0.0 0.0 0.0 63.1 63.1 0.0 0.0 0.0 183.8 352 400 0.880 0 0
L1 15/0 881.27 3.4 82.2 61.7 20.5 30.8 10.4 0.0 0.5 128.5 48.2 655 38 17.235 462M 25M
L2 95/0 5538.28 2.2 55.4 9.0 46.4 51.8 5.3 0.5 5.8 54.3 50.7 1045 136 7.683 1238M 33M
L3 7/0 458.47 0.0 0.5 0.4 0.1 0.5 0.4 0.0 1.1 59.5 59.5 9 7 1.259 12M 1
Sum 125/0 8318.40 0.0 138.1 71.2 67.0 146.3 79.3 0.5 2.3 68.6 72.7 2061 581 3.547 1712M 58M
Int 0/0 0.00 0.0 66.2 38.5 27.6 65.6 38.0 0.0 1.9 87.5 86.8 774 257 3.013 556M 39M
Uptime(secs): 1953.4 total, 1953.4 interval
Flush(GB): cumulative 63.137, interval 35.154
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 146.26 GB write, 76.67 MB/s write, 138.13 GB read, 72.41 MB/s read, 2060.5 seconds
Interval compaction: 65.64 GB write, 34.41 MB/s write, 66.16 GB read, 34.68 MB/s read, 774.4 seconds
Stalls(count): 11 level0_slowdown, 11 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 7 total count
** DB Stats **
Uptime(secs): 1953.4 total, 603.3 interval
Cumulative writes: 18M writes, 251M keys, 18M commit groups, 1.0 writes per commit group, ingest: 91.83 GB, 48.14 MB/s
Cumulative WAL: 18M writes, 4111K syncs, 4.52 writes per sync, written: 91.83 GB, 48.14 MB/s
Cumulative stall: 00:01:7.797 H:M:S, 3.5 percent
Interval writes: 10M writes, 69M keys, 10M commit groups, 1.0 writes per commit group, ingest: 48121.62 MB, 79.77 MB/s
Interval WAL: 10M writes, 2170K syncs, 4.99 writes per sync, written: 46.99 MB, 79.77 MB/s
Interval stall: 00:00:20.024 H:M:S, 3.3 percent
16k min_alloc:
2016-12-23 10:20:03.926747 7fef2993d700 3 rocksdb: ------- DUMPING STATS -------
2016-12-23 10:20:03.926754 7fef2993d700 3 rocksdb:
** Compaction Stats [default] **
Level Files Size(MB} Score Read(GB} Rn(GB} Rnp1(GB} Write(GB} Wnew(GB} Moved(GB} W-Amp Rd(MB/s} Wr(MB/s} Comp(sec} Comp(cnt} Avg(sec} KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
L0 3/0 186.38 0.8 0.0 0.0 0.0 49.4 49.4 0.0 0.0 0.0 179.0 283 805 0.351 0 0
L1 13/0 336.75 1.4 80.5 49.2 31.3 41.4 10.2 0.0 0.8 139.7 71.9 590 135 4.371 399M 53M
L2 33/0 1933.96 0.8 62.4 9.4 53.0 54.5 1.4 0.4 5.8 72.0 62.8 887 145 6.120 1039M 70M
Sum 49/0 2457.09 0.0 142.9 58.6 84.3 145.3 61.0 0.4 2.9 83.1 84.5 1760 1085 1.622 1438M 123M
Int 0/0 0.00 0.0 61.6 25.1 36.5 61.5 25.0 0.0 2.9 87.6 87.4 720 466 1.545 586M 56M
Uptime(secs): 1951.3 total, 1951.3 interval
Flush(GB): cumulative 49.411, interval 21.131
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 145.30 GB write, 76.25 MB/s write, 142.90 GB read, 74.99 MB/s read, 1760.2 seconds
Interval compaction: 61.47 GB write, 32.26 MB/s write, 61.59 GB read, 32.32 MB/s read, 720.0 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
** DB Stats **
Uptime(secs): 1951.3 total, 604.4 interval
Cumulative writes: 32M writes, 260M keys, 32M commit groups, 1.0 writes per commit group, ingest: 190.02 GB, 99.72 MB/s
Cumulative WAL: 32M writes, 1032K syncs, 31.20 writes per sync, written: 190.02 GB, 99.72 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 14M writes, 99M keys, 14M commit groups, 1.0 writes per commit group, ingest: 84136.97 MB, 139.20 MB/s
Interval WAL: 14M writes, 268K syncs, 52.14 writes per sync, written: 82.17 MB, 139.20 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent
Mark
Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@xxxxxxxxxxx
-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson
Sent: Friday, December 23, 2016 9:09 AM
To: Sage Weil <sweil@xxxxxxxxxx>
Cc: Somnath Roy <Somnath.Roy@xxxxxxxxxxx>; ceph-devel <ceph-
devel@xxxxxxxxxxxxxxx>
Subject: Re: Bluestore performance bottleneck
Try this?
https://github.com/ceph/ceph/pull/12634
Looks like this is most likely reducing the memory usage and
increasing performance quite a bit with smaller shard target/max
values. With
25/50 I'm seeing more like 2.6GB RSS memory usage and around 13K iops
typically with some (likely rocksdb) stalls. I'll run through the
tests again.
Mark
Ok, Ran through tests with both 4k and 16k min_alloc/max_alloc/blob sizes
using master+12629+12634:
https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZQzdRU3B
1SGZUbDQ
Performance is up in all tests and memory consumption is down (especially in
the smaller target/max tests). It looks like 100/200 is probably the current
optimal configuration on my test setup. 4K min_alloc tests hover around
22.5K IOPS with ~1300% CPU usage, and 16K min_alloc tests hover around
25K IOPs with ~1000% CPU usage. I think it will be worth spending some time
looking at locking in the bitmap allocator given the perf traces. Beyond that,
I'm seeing rocksdb show up quite a bit in the top CPU consuming functions
now, especially CRC32.
Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at
http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html