Re: Squeezing Performance of CEPH

Massimiliano Cuttini <max@xxxxxxxxxxxxx> · Fri, 23 Jun 2017 10:24:47 +0200



    Hi Mark,
    having 2 node for testing allow me to downgrade the replication
      to 2x (till the production).

      SSD have the following product details:

    
      sequential read: 540MB/sec
      sequential write: 520MB/sec

      
    As you state my sequential write should be:
    
      ~600 * 2 (copies) * 2 (journal write per copy) / 8 (ssds) =
        ~225,25MB/s
    
    If you think that 2 copies should be simultaneously on
          different cards/networks/nodes my calculation are:
    
      ~600 * 2 (journal write per copy) / 8 (ssds) = ~112,625MB/s
    
    So yes, I think that they are
      terrible low (but maybe I miss something), about 20,8% of the
      theorical speed of an SSD.

      Sequential Read are quite low too.

      Maybe only Random Read is good.

    
    Any suggestion?

    
    Il 22/06/2017 19:41, Mark Nelson ha
      scritto:

    
    Hello
      Massimiliano,
      

      Based on the configuration below, it appears you have 8 SSDs total
      (2 nodes with 4 SSDs each)?
      

      I'm going to assume you have 3x replication and are you using
      filestore, so in reality you are writing 3 copies and doing full
      data journaling for each copy, for 6x writes per client write. 
      Taking this into account, your per-SSD throughput should be
      somewhere around:
      

      Sequential write:
      

      ~600 * 3 (copies) * 2 (journal write per copy) / 8 (ssds) =
      ~450MB/s
      

      Sequential read
      

      ~3000 / 8 (ssds) = ~375MB/s
      

      Random read
      

      ~3337 / 8 (ssds) = ~417MB/s
      

      These numbers are pretty reasonable for SATA based SSDs, though
      the read throughput is a little low.  You didn't include the model
      of SSD, but if you look at Intel's DC S3700 which is a fairly
      popular SSD for ceph:
      

https://www.intel.com/content/www/us/en/solid-state-drives/ssd-dc-s3700-spec.html
      

      Sequential read is up to ~500MB/s and Sequential write speeds up
      to 460MB/s.  Not too far off from what you are seeing.  You might
      try playing with readahead on the OSD devices to see if that
      improves things at all.  Still, unless I've missed something these
      numbers aren't terrible.
      

      Mark
      

      On 06/22/2017 12:19 PM, Massimiliano Cuttini wrote:
      

      Hi everybody,
        

        I want to squeeze all the performance of CEPH (we are using
        jewel 10.2.7).
        

        We are testing a testing environment with 2 nodes having the
        same
        

        configuration:
        

          * CentOS 7.3
        

          * 24 CPUs (12 for real in hyper threading)
        

          * 32Gb of RAM
        

          * 2x 100Gbit/s ethernet cards
        

          * 2x OS dedicated in raid SSD Disks
        

          * 4x OSD SSD Disks SATA 6Gbit/s
        

        We are already expecting the following bottlenecks:
        

          * [ SATA speed x n° disks ] = 24Gbit/s
        

          * [ Networks speed x n° bonded cards ] = 200Gbit/s
        

        So the minimum between them is 24 Gbit/s per node (not taking in
        account
        

        protocol loss).
        

        24Gbit/s per node x2 = 48Gbit/s of maximum hypotetical theorical
        gross
        

        speed.
        

        Here are the tests:
        

        ///////IPERF2/////// Tests are quite good scoring 88% of the
        bottleneck.
        

        Note: iperf2 can use only 1 connection from a bond.(it's a well
        know issue).
        

            [ ID] Interval       Transfer     Bandwidth
        

            [ 12]  0.0-10.0 sec  9.55 GBytes  8.21 Gbits/sec
        

            [  3]  0.0-10.0 sec  10.3 GBytes  8.81 Gbits/sec
        

            [  5]  0.0-10.0 sec  9.54 GBytes  8.19 Gbits/sec
        

            [  7]  0.0-10.0 sec  9.52 GBytes  8.18 Gbits/sec
        

            [  6]  0.0-10.0 sec  9.96 GBytes  8.56 Gbits/sec
        

            [  8]  0.0-10.0 sec  12.1 GBytes  10.4 Gbits/sec
        

            [  9]  0.0-10.0 sec  12.3 GBytes  10.6 Gbits/sec
        

            [ 10]  0.0-10.0 sec  10.2 GBytes  8.80 Gbits/sec
        

            [ 11]  0.0-10.0 sec  9.34 GBytes  8.02 Gbits/sec
        

            [  4]  0.0-10.0 sec  10.3 GBytes  8.82 Gbits/sec
        

            [SUM]  0.0-10.0 sec   103 GBytes  88.6 Gbits/sec
        

        ///////RADOS BENCH
        

        Take in consideration the maximum hypotetical speed of 48Gbit/s
        tests
        

        (due to disks bottleneck), tests are not good enought.
        

          * Average MB/s in write is almost 5-7Gbit/sec (12,5% of the
        mhs)
        

          * Average MB/s in seq read is almost 24Gbit/sec (50% of the
        mhs)
        

          * Average MB/s in random read is almost 27Gbit/se (56,25% of
        the mhs).
        

        Here are the reports.
        

        Write:
        

            # rados bench -p scbench 10 write --no-cleanup
        

            Total time run:         10.229369
        

            Total writes made:      1538
        

            Write size:             4194304
        

            Object size:            4194304
        

            Bandwidth (MB/sec):     601.406
        

            Stddev Bandwidth:       357.012
        

            Max bandwidth (MB/sec): 1080
        

            Min bandwidth (MB/sec): 204
        

            Average IOPS:           150
        

            Stddev IOPS:            89
        

            Max IOPS:               270
        

            Min IOPS:               51
        

            Average Latency(s):     0.106218
        

            Stddev Latency(s):      0.198735
        

            Max latency(s):         1.87401
        

            Min latency(s):         0.0225438
        

        sequential read:
        

            # rados bench -p scbench 10 seq
        

            Total time run:       2.054359
        

            Total reads made:     1538
        

            Read size:            4194304
        

            Object size:          4194304
        

            Bandwidth (MB/sec):   2994.61
        

            Average IOPS          748
        

            Stddev IOPS:          67
        

            Max IOPS:             802
        

            Min IOPS:             707
        

            Average Latency(s):   0.0202177
        

            Max latency(s):       0.223319
        

            Min latency(s):       0.00589238
        

        random read:
        

            # rados bench -p scbench 10 rand
        

            Total time run:       10.036816
        

            Total reads made:     8375
        

            Read size:            4194304
        

            Object size:          4194304
        

            Bandwidth (MB/sec):   3337.71
        

            Average IOPS:         834
        

            Stddev IOPS:          78
        

            Max IOPS:             927
        

            Min IOPS:             741
        

            Average Latency(s):   0.0182707
        

            Max latency(s):       0.257397
        

            Min latency(s):       0.00469212
        

        //------------------------------------
        

        It's seems like that there are some bottleneck somewhere that we
        are
        

        understimating.
        

        Can you help me to found it?
        

        _______________________________________________
        

        ceph-users mailing list
        

        ceph-users@xxxxxxxxxxxxxx
        

        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
        

      _______________________________________________
      

      ceph-users mailing list
      

      ceph-users@xxxxxxxxxxxxxx
      

      http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
      

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com