On 09/13/2012 02:24 AM, Dieter Kasper wrote:
On Thu, Sep 13, 2012 at 12:25:36AM +0200, Mark Nelson wrote:
On 09/12/2012 03:08 PM, Dieter Kasper wrote:
On Mon, Sep 10, 2012 at 10:39:58PM +0200, Mark Nelson wrote:
On 09/10/2012 03:15 PM, Mike Ryan wrote:
*Disclaimer*: these results are an investigation into potential
bottlenecks in RADOS.
I appreciate this investigation very much !
The test setup is wholly unrealistic, and these
numbers SHOULD NOT be used as an indication of the performance of OSDs,
messaging, RADOS, or ceph in general.
Executive summary: rados bench has some internal bottleneck. Once that's
cleared up, we're still having some issues saturating a single
connection to an OSD. Having 2-3 connection in parallel alleviates that
(either by having> 1 OSD or having multiple bencher clients).
I've run three separate tests: msbench, smalliobench, and rados bench.
In all cases I was trying to determine where bottleneck(s) exist. All
the tests were run on a machine with 192 GB of RAM. The backing stores
for all OSDs and journals are RAMdisks. The stores are running XFS.
smalliobench: I ran tests varying the number of OSDs and bencher
clients. In all cases, the number of PG's per OSD is 100.
OSD Bencher Throughput (mbyte/sec)
1 1 510
1 2 800
1 3 850
2 1 640
2 2 660
2 3 670
3 1 780
3 2 820
3 3 870
4 1 850
4 2 970
4 3 990
Note: these numbers are fairly fuzzy. I eyeballed them and they're only
really accurate to about 10 mbyte/sec. The small IO bencher was run with
100 ops in flight, 4 mbyte io's, 4 mbyte files.
msbench: ran tests trying to determine max throughput of raw messaging
layer. Varied the number of concurrently connected msbench clients and
measured aggregate throughput. Take-away: a messaging client can very
consistently push 400-500 mbytes/sec through a single socket.
Clients Throughput (mbyte/sec)
1 520
2 880
3 1300
4 1900
Finally, rados bench, which seems to have its own bottleneck. Running
varying numbers of these, each client seems to get 250 mbyte/sec up till
the aggregate rate is around 1000 mbyte/sec (appx line speed as measured
by iperf). These were run on a pool with 100 PGs/OSD.
Clients Throughput (mbyte/sec)
1 250
2 500
3 750
4 1000 (very fuzzy, probably 1000 +/- 75)
5 1000, seems to level out here
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi guys,
Some background on all of this:
We've been doing some performance testing at Inktank and noticed that
performance with a single rados bench instance was plateauing at between
600-700MB/s.
4-nodes with 10GbE interconnect; journals in RAM-Disk; replica=2
# rados bench -p pbench 20 write
Maintaining 16 concurrent writes of 4194304 bytes for at least 20 seconds.
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 0 0 0 - 0
1 16 288 272 1087.81 1088 0.051123 0.0571643
2 16 579 563 1125.85 1164 0.045729 0.0561784
3 16 863 847 1129.19 1136 0.042012 0.0560869
4 16 1150 1134 1133.87 1148 0.05466 0.0559281
5 16 1441 1425 1139.87 1164 0.036852 0.0556809
6 16 1733 1717 1144.54 1168 0.054594 0.0556124
7 16 2007 1991 1137.59 1096 0.04454 0.0556698
8 16 2290 2274 1136.88 1132 0.046777 0.0560103
9 16 2580 2564 1139.44 1160 0.073328 0.0559353
10 16 2871 2855 1141.88 1164 0.034091 0.0558576
11 16 3158 3142 1142.43 1148 0.250688 0.0558404
12 16 3445 3429 1142.88 1148 0.046941 0.0558071
13 16 3726 3710 1141.42 1124 0.054092 0.0559
14 16 4014 3998 1142.17 1152 0.03531 0.0558533
15 16 4298 4282 1141.75 1136 0.040005 0.0559383
16 16 4582 4566 1141.39 1136 0.048431 0.0559162
17 16 4859 4843 1139.42 1108 0.045805 0.0559891
18 16 5145 5129 1139.66 1144 0.046805 0.0560177
19 16 5422 5406 1137.99 1108 0.037295 0.0561341
2012-09-08 14:36:32.460311min lat: 0.029503 max lat: 0.47757 avg lat: 0.0561424
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
20 16 5701 5685 1136.89 1116 0.041493 0.0561424
Total time run: 20.197129
Total writes made: 5702
Write size: 4194304
Bandwidth (MB/sec): 1129.269
Stddev Bandwidth: 23.7487
Max bandwidth (MB/sec): 1168
Min bandwidth (MB/sec): 1088
Average Latency: 0.0564675
Stddev Latency: 0.0327582
Max latency: 0.47757
Min latency: 0.029503
Best Regards,
-Dieter
Well look at that! :) Now I've gotta figure out what the difference is.
How fast are the CPUs in your rados bench machine there?
One CPU socket in each node:
model name : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
Logial CPUs: 12
MemTotal: 32856332 kB
I'm using 2x E5-2360L at 2.0GHz. So yours are slightly faster, but not
significantly so. I am running the tests on localhost though, so
perhaps that is having a negative effect rather than a positive one.
Soon I will be testing on 10GbE and bonded 10GbE.
Also, I should mention that at these speeds, we noticed that crc32c
calculations were actually having a pretty big effect.
perf report
Events: 39K cycles
+ 26.29% ceph-osd ceph-osd [.] 0x45e60b
+ 4.74% ceph-osd [kernel.kallsyms] [k] copy_user_generic_string
+ 3.37% ceph-mon ceph-mon [.] MHeartbeat::decode_payload()
+ 2.88% ceph-osd [kernel.kallsyms] [k] futex_wake
+ 2.61% swapper [kernel.kallsyms] [k] intel_idle
+ 2.34% ceph-osd [kernel.kallsyms] [k] __memcpy
+ 1.71% ceph-osd libc-2.11.3.so [.] memcpy
+ 1.70% ceph-osd [kernel.kallsyms] [k] __copy_user_nocache
+ 1.66% ceph-osd [kernel.kallsyms] [k] futex_requeue
+ 1.33% ceph-mon ceph-mon [.] MOSDOpReply::~MOSDOpReply()
+ 1.18% ceph-mon libc-2.11.3.so [.] memcpy
+ 1.16% ceph-mon ceph-mon [.] MOSDPGInfo::decode_payload()
+ 0.97% ceph-osd [kernel.kallsyms] [k] futex_wake_op
+ 0.86% ceph-mon ceph-mon [.] MExportDirDiscoverAck::print(std::ostream&) const
+ 0.79% ceph-osd [kernel.kallsyms] [k] _raw_spin_lock
+ 0.74% ceph-mon ceph-mon [.] MOSDPing::decode_payload()
+ 0.52% ceph-osd libtcmalloc.so.0.3.0 [.] operator new(unsigned long)
+ 0.51% ceph-mon ceph-mon [.] MDiscover::print(std::ostream&) const
+ 0.48% ceph-osd [xfs] [k] xfs_bmap_add_extent
+ 0.43% ceph-mon [kernel.kallsyms] [k] copy_user_generic_string
+ 0.39% ceph-osd [kernel.kallsyms] [k] iov_iter_fault_in_readable
Looks like you are having the same issues I do with user symbols in
ceph-osd not showing up in perf. They show up fine in sysprof for me.
I bet a good chunk of the 26.29% at the top is crc32c calculation.
Regards,
-Dieter
Turning them off
gave us a 10% performance boost. We're looking at faster
implementations now.
Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html