Read-out much slower than write-in on my ceph cluster

FaHui Lin <fahui.lin@xxxxxxxxxx> · Wed, 28 Oct 2015 10:59:05 +0800



    Dear Ceph experts,

    
    I found something strange about the performance of my Ceph cluster:
    Read-out much slower than write-in.

    
    I have 3 machines running OSDs, each has 8 OSDs running on 8 raid0s
    (each made up of 2 HDDs) respectively. The OSD journal and data the
    is on the same device.  All machines in my clusters have 10Gb
    network.

    
    I used both Ceph RBD and CephFS, the client on another machine
    outside cluster or on one of the running OSD (to rule out possible
    network issue), an so on. All of these end up in a similar results:
    write-in can almost reach the network limit, say 1200 MB/s, while
    read-out is only 350~450 MB/s.

    
    Trying to figure out, I did an extra test using CephFS:

    
    Version and Config:

    [root@dl-disk1
        ~]# ceph --version

        ceph version 0.94.3
        (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)

        [root@dl-disk1 ~]# cat /etc/ceph/ceph.conf

        [global]

        fsid = (hidden)

        mon_initial_members = dl-disk1, dl-disk2, dl-disk3

        mon_host = (hidden)

        auth_cluster_required = cephx

        auth_service_required = cephx

        auth_client_required = cephx

        filestore_xattr_use_omap = true

    
    OSD tree:

    # ceph osd tree

        ID WEIGHT    TYPE NAME         UP/DOWN REWEIGHT PRIMARY-AFFINITY

        -1 258.88000 root default

        -2  87.28000     host dl-disk1

         0  10.90999         osd.0          up  1.00000          1.00000

         1  10.90999         osd.1          up  1.00000          1.00000

         2  10.90999         osd.2          up  1.00000          1.00000

         3  10.90999         osd.3          up  1.00000          1.00000

         4  10.90999         osd.4          up  1.00000          1.00000

         5  10.90999         osd.5          up  1.00000          1.00000

         6  10.90999         osd.6          up  1.00000          1.00000

         7  10.90999         osd.7          up  1.00000          1.00000

        -3  87.28000     host dl-disk2

         8  10.90999         osd.8          up  1.00000          1.00000

         9  10.90999         osd.9          up  1.00000          1.00000

        10  10.90999         osd.10         up  1.00000          1.00000

        11  10.90999         osd.11         up  1.00000          1.00000

        12  10.90999         osd.12         up  1.00000          1.00000

        13  10.90999         osd.13         up  1.00000          1.00000

        14  10.90999         osd.14         up  1.00000          1.00000

        15  10.90999         osd.15         up  1.00000          1.00000

        -4  84.31999     host dl-disk3

        16  10.53999         osd.16         up  1.00000          1.00000

        17  10.53999         osd.17         up  1.00000          1.00000

        18  10.53999         osd.18         up  1.00000          1.00000

        19  10.53999         osd.19         up  1.00000          1.00000

        20  10.53999         osd.20         up  1.00000          1.00000

        21  10.53999         osd.21         up  1.00000          1.00000

        22  10.53999         osd.22         up  1.00000          1.00000

        23  10.53999         osd.23         up  1.00000          1.00000

    
    Pools and PG (each pool has 128 PGs):

    #
            ceph osd lspools

            0 rbd,2 fs_meta,3 fs_data0,4 fs_data1,

            # ceph pg dump pools

            dumped pools in format plain

            pg_stat objects mip     degr    misp    unf     bytes  
            log     disklog

            pool 0  0       0       0       0       0       0      
            0       0

            pool 2  20      0       0       0       0       356958 
            264     264

            pool 3  3264    0       0       0       0      
            16106127360     14657   14657

            pool 4  0       0       0       0       0       0      
            0       0

    
    To simplify the problem, I made a new crush rule that the CephFS
    data pool use OSDs on only one machine (dl-disk1 here), and size =
    1.

    #
            ceph osd crush rule dump osd_in_dl-disk1__ruleset

            {

                "rule_id": 1,

                "rule_name": "osd_in_dl-disk1__ruleset",

                "ruleset": 1,

                "type": 1,

                "min_size": 1,

                "max_size": 10,

                "steps": [

                    {

                        "op": "take",

                        "item": -2,

                        "item_name": "dl-disk1"

                    },

                    {

                        "op": "chooseleaf_firstn",

                        "num": 0,

                        "type": "osd"

                    },

                    {

                        "op": "emit"

                    }

                ]

            }

            # ceph osd pool get fs_data0 crush_ruleset

            crush_ruleset: 1

            # ceph osd pool get fs_data0 size

            size: 1

    
    Here starts the test.

    On an client machine, I used dd to write a 4GB-file to CephFS, and
    checked dstat on the OSD node dl-disk1:

    [root@client
          ~]# dd of=/mnt/cephfs/4Gfile if=/dev/zero bs=4096k count=1024

          1024+0 records in

          1024+0 records out

          4294967296 bytes (4.3 GB) copied, 3.69993 s, 1.2 GB/s

          
          [root@dl-disk1 ~]# dstat ...

          ---total-cpu-usage---- ------memory-usage----- -net/total-
--dsk/sdb-----dsk/sdc-----dsk/sdd-----dsk/sde-----dsk/sdf-----dsk/sdg-----dsk/sdh-----dsk/sdi--

          usr sys idl wai hiq siq| used  buff  cach  free| recv  send|
          read  writ: read  writ: read  writ: read  writ: read  writ:
          read  writ: read  writ: read  writ

           
            0   0 100   0   0   0|3461M 67.2M 15.1G 44.3G|  19k   20k|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            0   0 100   0   0   0|3461M 67.2M 15.1G 44.3G|  32k   32k|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            8  18  74   0   0   0|3364M 67.2M 11.1G 48.4G| 391k  391k|  
          0  2712k:   0  1096k:   0   556k:   0  1084k:   0  1200k:   0 
          1196k:   0   688k:   0  1252k

            0   0 100   0   0   0|3364M 67.2M 11.1G 48.4G|  82k  127k|  
          0     0 :   0     0 :   0     0 :   0   928k:   0   540k:  
          0     0 :   0     0 :   0     0

            8  16  72   3   0   1|3375M 67.2M 11.8G 47.7G| 718M 2068k|  
          0   120M:   0   172M:   0    76M:   0   220M:   0   188M: 
          16k  289M:   0    53M:   0    36M

            6  13  77   4   0   1|3391M 67.2M 12.3G 47.1G| 553M 1517k|  
          0   160M:   0   176M:   0    88M:   0   208M:   0   225M:  
          0   213M:   0  8208k:   0    49M

            6  13  77   3   0   1|3408M 67.2M 12.9G 46.6G| 544M 1272k|  
          0   212M:   0  8212k:   0    36M:   0     0 :   0    37M:   0 
          3852k:   0   497M:   0   337M

            0   0  99   0   0   0|3407M 67.3M 12.9G 46.6G|  53k  114k|  
          0    36M:   0    37M:   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            0   0 100   0   0   0|3407M 67.3M 12.9G 46.6G|  68k  110k|  
          0     0 :   0     0 :   0     0 :   0    36M:   0     0 :  
          0     0 :   0     0 :   0     0

            0   0  99   0   0   0|3407M 67.3M 12.9G 46.6G|  38k  328k|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0    36M:   0     0

            0   1  99   0   0   0|3406M 67.3M 12.9G 46.6G|  11M  132k|  
          0     0 :   0     0 :   0  8224k:   0     0 :   0     0 :  
          0    32M:   0     0 :   0    36M

           14  24  52   8   0   2|3436M 67.3M 13.8G 45.6G|1026M 2897k|  
          0   100M:   0   409M:   0   164M:   0   313M:   0   253M:  
          0   321M:   0    84M:   0    76M

           14  24  34  27   0   1|3461M 67.3M 14.7G 44.7G| 990M 2565k|  
          0   354M:   0    72M:   0     0 :   0   164M:   0   313M:  
          0   188M:   0   308M:   0   333M

            4   9  70  16   0   0|3474M 67.3M 15.1G 44.3G| 269M  646k|  
          0   324M:   0     0 :   0     0 :   0    36M:   0     0 :  
          0     0 :   0   349M:   0   172M

            0   0  99   0   0   0|3474M 67.3M 15.1G 44.3G|  24k  315k|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0    37M:   0     0

            0   0  99   0   0   0|3474M 67.4M 15.1G 44.3G|  38k  102k|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0    36M:   0     0 :   0    36M

            0   0  99   0   0   0|3473M 67.4M 15.1G 44.3G|  22k   23k|  
          0     0 :   0     0 :   0    36M:   0     0 :   0    36M:  
          0     0 :   0     0 :   0     0

            0   0 100   0   0   0|3473M 67.4M 15.1G 44.3G|  39k   40k|  
          0   304k:   0    16k:   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            0   0 100   0   0   0|3472M 67.4M 15.1G 44.3G|  28k   64k|  
          0    64M:   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            0   0 100   0   0   0|3471M 67.4M 15.1G 44.3G|  31k   94k|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            0   0 100   0   0   0|3472M 67.4M 15.1G 44.3G|  38k   39k|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

    
    The throughput is 1.2 GB/s, able to reach the network limit 10Gb.

    
    Then, on the client machine, I used dd to read that file back from
    CephFS, redirecting the file to /dev/zero (or /dev/null) to rule out
    local HDD's IO:

    [root@client
          ~]# dd if=/mnt/cephfs/4Gfile of=/dev/zero bs=4096k count=1024

          1024+0 records in

          1024+0 records out

          4294967296 bytes (4.3 GB) copied, 8.85246 s, 485 MB/s

        
    [root@dl-disk1
                ~]# dstat ...

            0   0 100   0   0   0|3462M 67.4M 15.1G 44.3G|  36k   36k|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            0   0 100   0   0   0|3462M 67.4M 15.1G 44.3G|  22k   22k|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            0   0 100   0   0   0|3463M 67.4M 15.1G 44.3G|  49k   49k|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            0   1  99   0   0   0|3464M 67.4M 15.1G 44.3G| 282k  111M|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            2   5  93   0   0   0|3466M 67.4M 15.1G 44.3G|1171k  535M|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            2   5  93   0   0   0|3467M 67.4M 15.1G 44.3G|1124k  535M|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            2   4  94   0   0   0|3467M 67.4M 15.1G 44.3G|1124k  535M|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            2   4  94   0   0   0|3467M 67.4M 15.1G 44.3G|1109k  527M|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            2   4  93   0   0   0|3471M 67.4M 15.1G 44.3G|1044k  504M|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            2   4  94   0   0   0|3470M 67.4M 15.1G 44.3G|1031k  504M|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            2   5  93   0   0   0|3470M 67.4M 15.1G 44.3G|1103k  527M|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            2   4  93   0   0   0|3471M 67.5M 15.1G 44.3G|1084k  504M|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            0   0 100   0   0   0|3470M 67.5M 15.1G 44.3G|  25k   24k|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

          ----total-cpu-usage---- ------memory-usage----- -net/total-
--dsk/sdb-----dsk/sdc-----dsk/sdd-----dsk/sde-----dsk/sdf-----dsk/sdg-----dsk/sdh-----dsk/sdi--

          usr sys idl wai hiq siq| used  buff  cach  free| recv  send|
          read  writ: read  writ: read  writ: read  writ: read  writ:
          read  writ: read  writ: read  writ

            0   0 100   0   0   0|3470M 67.5M 15.1G 44.3G|  43k   44k|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            0   0 100   0   0   0|3470M 67.5M 15.1G 44.3G|  22k   23k|  
          0    48k:   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            0   0 100   0   0   0|3469M 67.5M 15.1G 44.3G|  35k   38k|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            0   0 100   0   0   0|3469M 67.5M 15.1G 44.3G|  23k   85k|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            0   0 100   0   0   0|3469M 67.5M 15.1G 44.3G|  44k   44k|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            0   0 100   0   0   0|3469M 67.5M 15.1G 44.3G|  24k   25k|  
          0    12k:   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            0   0 100   0   0   0|3469M 67.5M 15.1G 44.3G|  45k   43k|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            0   0 100   0   0   0|3468M 67.5M 15.1G 44.3G|  17k   18k|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

        
    The throughput here was only 400~500 MB/s here.

    I noticed that there was NO disk I/O during the read-out, that means
    all the objects of the file were already cached in memory on the OSD
    node.

    Thus, HDDs does NOT seem to cause the lower throughput.

    
    I also tried read-out using cat  (in case dd may not use read-ahead
    in file system. ), ended up getting similar result:

    
    [root@client
          ~]# time cat /mnt/cephfs/4Gfile > /dev/zero

          
          real    0m9.352s

          user    0m0.002s

          sys     0m4.147s

          
    [root@dl-disk1
                ~]# dstat ...

            0   0 100   0   0   0|3465M 67.5M 15.1G 44.3G|  23k   22k|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            0   0 100   0   0   0|3465M 67.5M 15.1G 44.3G|  17k   18k|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            0   0 100   0   0   0|3465M 67.5M 15.1G 44.3G|  37k   37k|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            1   2  97   0   0   0|3466M 67.5M 15.1G 44.3G| 633k  280M|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            2   4  94   0   0   0|3467M 67.5M 15.1G 44.3G|1057k  498M|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            2   4  94   0   0   0|3470M 67.5M 15.1G 44.3G|1078k  498M|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            2   4  94   0   0   0|3470M 67.5M 15.1G 44.3G| 996k  486M|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            2   4  94   0   0   0|3469M 67.5M 15.1G 44.3G| 988k  489M|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            2   4  94   0   0   0|3469M 67.5M 15.1G 44.3G|1012k  489M|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            2   4  94   0   0   0|3470M 67.5M 15.1G 44.3G|1017k  497M|  
          0     0 :   0  8192B:   0    28k:   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            2   4  94   0   0   0|3469M 67.5M 15.1G 44.3G|1032k  498M|  
          0     0 :   0     0 :   0     0 :   0  8192B:   0   104k:  
          0     0 :   0     0 :   0     0

          ----total-cpu-usage---- ------memory-usage----- -net/total-
--dsk/sdb-----dsk/sdc-----dsk/sdd-----dsk/sde-----dsk/sdf-----dsk/sdg-----dsk/sdh-----dsk/sdi--

          usr sys idl wai hiq siq| used  buff  cach  free| recv  send|
          read  writ: read  writ: read  writ: read  writ: read  writ:
          read  writ: read  writ: read  writ

            2   4  94   0   0   0|3469M 67.5M 15.1G 44.3G|1025k  496M|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0    40k:   0    80k:   0     0

            0   1  99   0   0   0|3469M 67.5M 15.1G 44.3G| 127k   52M|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0   120k

            0   0 100   0   0   0|3469M 67.5M 15.1G 44.3G|  21k   21k|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            0   0 100   0   0   0|3469M 67.5M 15.1G 44.3G|  66k   66k|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

            0   0 100   0   0   0|3469M 67.5M 15.1G 44.3G|  35k   38k|  
          0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
          0     0 :   0     0 :   0     0

          
    The average throughput is 4GB / 9.35s = 438 MB/s. Still, unlikely to
    be HDD's issue.

    
    I'm sure that the network can reach 10Gb in both ways via iperf or
    other test, and there's no other user process occupying bandwidth.

    
    Could you please help me some to find out the main reason for this
    issue? Thank you.

    
    Best Regards,

    FaHui

    
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com