Re: messaging/IO/radosbench results

Dieter Kasper <d.kasper@xxxxxxxxxxxx> · Thu, 13 Sep 2012 09:24:25 +0200



On Thu, Sep 13, 2012 at 12:25:36AM +0200, Mark Nelson wrote:
> On 09/12/2012 03:08 PM, Dieter Kasper wrote:
> > On Mon, Sep 10, 2012 at 10:39:58PM +0200, Mark Nelson wrote:
> >> On 09/10/2012 03:15 PM, Mike Ryan wrote:
> >>> *Disclaimer*: these results are an investigation into potential
> >>> bottlenecks in RADOS.
> > I appreciate this investigation very much !
> >
> >>> The test setup is wholly unrealistic, and these
> >>> numbers SHOULD NOT be used as an indication of the performance of OSDs,
> >>> messaging, RADOS, or ceph in general.
> >>>
> >>>
> >>> Executive summary: rados bench has some internal bottleneck. Once that's
> >>> cleared up, we're still having some issues saturating a single
> >>> connection to an OSD. Having 2-3 connection in parallel alleviates that
> >>> (either by having>   1 OSD or having multiple bencher clients).
> >>>
> >>>
> >>> I've run three separate tests: msbench, smalliobench, and rados bench.
> >>> In all cases I was trying to determine where bottleneck(s) exist. All
> >>> the tests were run on a machine with 192 GB of RAM. The backing stores
> >>> for all OSDs and journals are RAMdisks. The stores are running XFS.
> >>>
> >>> smalliobench: I ran tests varying the number of OSDs and bencher
> >>> clients. In all cases, the number of PG's per OSD is 100.
> >>>
> >>> OSD     Bencher     Throughput (mbyte/sec)
> >>> 1       1           510
> >>> 1       2           800
> >>> 1       3           850
> >>> 2       1           640
> >>> 2       2           660
> >>> 2       3           670
> >>> 3       1           780
> >>> 3       2           820
> >>> 3       3           870
> >>> 4       1           850
> >>> 4       2           970
> >>> 4       3           990
> >>>
> >>> Note: these numbers are fairly fuzzy. I eyeballed them and they're only
> >>> really accurate to about 10 mbyte/sec. The small IO bencher was run with
> >>> 100 ops in flight, 4 mbyte io's, 4 mbyte files.
> >>>
> >>> msbench: ran tests trying to determine max throughput of raw messaging
> >>> layer. Varied the number of concurrently connected msbench clients and
> >>> measured aggregate throughput. Take-away: a messaging client can very
> >>> consistently push 400-500 mbytes/sec through a single socket.
> >>>
> >>> Clients     Throughput (mbyte/sec)
> >>> 1           520
> >>> 2           880
> >>> 3           1300
> >>> 4           1900
> >>>
> >>> Finally, rados bench, which seems to have its own bottleneck. Running
> >>> varying numbers of these, each client seems to get 250 mbyte/sec up till
> >>> the aggregate rate is around 1000 mbyte/sec (appx line speed as measured
> >>> by iperf). These were run on a pool with 100 PGs/OSD.
> >>>
> >>> Clients     Throughput (mbyte/sec)
> >>> 1           250
> >>> 2           500
> >>> 3           750
> >>> 4           1000 (very fuzzy, probably 1000 +/- 75)
> >>> 5           1000, seems to level out here
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >> Hi guys,
> >>
> >> Some background on all of this:
> >>
> >> We've been doing some performance testing at Inktank and noticed that
> >> performance with a single rados bench instance was plateauing at between
> >> 600-700MB/s.
> >
> > 4-nodes with 10GbE interconnect; journals in RAM-Disk; replica=2
> >
> > # rados bench -p pbench 20 write
> >   Maintaining 16 concurrent writes of 4194304 bytes for at least 20 seconds.
> >     sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
> >       0       0         0         0         0         0         -         0
> >       1      16       288       272   1087.81      1088  0.051123 0.0571643
> >       2      16       579       563   1125.85      1164  0.045729 0.0561784
> >       3      16       863       847   1129.19      1136  0.042012 0.0560869
> >       4      16      1150      1134   1133.87      1148   0.05466 0.0559281
> >       5      16      1441      1425   1139.87      1164  0.036852 0.0556809
> >       6      16      1733      1717   1144.54      1168  0.054594 0.0556124
> >       7      16      2007      1991   1137.59      1096   0.04454 0.0556698
> >       8      16      2290      2274   1136.88      1132  0.046777 0.0560103
> >       9      16      2580      2564   1139.44      1160  0.073328 0.0559353
> >      10      16      2871      2855   1141.88      1164  0.034091 0.0558576
> >      11      16      3158      3142   1142.43      1148  0.250688 0.0558404
> >      12      16      3445      3429   1142.88      1148  0.046941 0.0558071
> >      13      16      3726      3710   1141.42      1124  0.054092    0.0559
> >      14      16      4014      3998   1142.17      1152   0.03531 0.0558533
> >      15      16      4298      4282   1141.75      1136  0.040005 0.0559383
> >      16      16      4582      4566   1141.39      1136  0.048431 0.0559162
> >      17      16      4859      4843   1139.42      1108  0.045805 0.0559891
> >      18      16      5145      5129   1139.66      1144  0.046805 0.0560177
> >      19      16      5422      5406   1137.99      1108  0.037295 0.0561341
> > 2012-09-08 14:36:32.460311min lat: 0.029503 max lat: 0.47757 avg lat: 0.0561424
> >     sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
> >      20      16      5701      5685   1136.89      1116  0.041493 0.0561424
> >   Total time run:         20.197129
> > Total writes made:      5702
> > Write size:             4194304
> > Bandwidth (MB/sec):     1129.269
> >
> > Stddev Bandwidth:       23.7487
> > Max bandwidth (MB/sec): 1168
> > Min bandwidth (MB/sec): 1088
> > Average Latency:        0.0564675
> > Stddev Latency:         0.0327582
> > Max latency:            0.47757
> > Min latency:            0.029503
> >
> >
> > Best Regards,
> > -Dieter
> >
> 
> Well look at that! :)  Now I've gotta figure out what the difference is. 
>   How fast are the CPUs in your rados bench machine there?

One CPU socket in each node:
model name      : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
Logial CPUs: 12
MemTotal:       32856332 kB

> 
> Also, I should mention that at these speeds, we noticed that crc32c 
> calculations were actually having a pretty big effect.  

perf report

Events: 39K cycles
+     26.29%         ceph-osd  ceph-osd                    [.] 0x45e60b                                     
+      4.74%         ceph-osd  [kernel.kallsyms]           [k] copy_user_generic_string                    
+      3.37%         ceph-mon  ceph-mon                    [.] MHeartbeat::decode_payload()               
+      2.88%         ceph-osd  [kernel.kallsyms]           [k] futex_wake                                
+      2.61%          swapper  [kernel.kallsyms]           [k] intel_idle                               
+      2.34%         ceph-osd  [kernel.kallsyms]           [k] __memcpy                                
+      1.71%         ceph-osd  libc-2.11.3.so              [.] memcpy                                 
+      1.70%         ceph-osd  [kernel.kallsyms]           [k] __copy_user_nocache                   
+      1.66%         ceph-osd  [kernel.kallsyms]           [k] futex_requeue                        
+      1.33%         ceph-mon  ceph-mon                    [.] MOSDOpReply::~MOSDOpReply()         
+      1.18%         ceph-mon  libc-2.11.3.so              [.] memcpy                             
+      1.16%         ceph-mon  ceph-mon                    [.] MOSDPGInfo::decode_payload()      
+      0.97%         ceph-osd  [kernel.kallsyms]           [k] futex_wake_op                    
+      0.86%         ceph-mon  ceph-mon                    [.] MExportDirDiscoverAck::print(std::ostream&) const 
+      0.79%         ceph-osd  [kernel.kallsyms]           [k] _raw_spin_lock                                   
+      0.74%         ceph-mon  ceph-mon                    [.] MOSDPing::decode_payload()                      
+      0.52%         ceph-osd  libtcmalloc.so.0.3.0        [.] operator new(unsigned long)                    
+      0.51%         ceph-mon  ceph-mon                    [.] MDiscover::print(std::ostream&) const         
+      0.48%         ceph-osd  [xfs]                       [k] xfs_bmap_add_extent                          
+      0.43%         ceph-mon  [kernel.kallsyms]           [k] copy_user_generic_string                    
+      0.39%         ceph-osd  [kernel.kallsyms]           [k] iov_iter_fault_in_readable                 

Regards,
-Dieter


> Turning them off 
> gave us a 10% performance boost.  We're looking at faster 
> implementations now.
> 
> Mark
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html