Re: Read-out much slower than write-in on my ceph cluster

Eric Eastman <eric.eastman@xxxxxxxxxxxxxx> · Wed, 28 Oct 2015 10:04:56 -0600



On the RBD performance issue, you may want to look at:
http://tracker.ceph.com/issues/9192

Eric

On Tue, Oct 27, 2015 at 8:59 PM, FaHui Lin <fahui.lin@xxxxxxxxxx> wrote:
> Dear Ceph experts,
>
> I found something strange about the performance of my Ceph cluster: Read-out
> much slower than write-in.
>
> I have 3 machines running OSDs, each has 8 OSDs running on 8 raid0s (each
> made up of 2 HDDs) respectively. The OSD journal and data the is on the same
> device.  All machines in my clusters have 10Gb network.
>
> I used both Ceph RBD and CephFS, the client on another machine outside
> cluster or on one of the running OSD (to rule out possible network issue),
> an so on. All of these end up in a similar results: write-in can almost
> reach the network limit, say 1200 MB/s, while read-out is only 350~450 MB/s.
>
> Trying to figure out, I did an extra test using CephFS:
>
> Version and Config:
> [root@dl-disk1 ~]# ceph --version
> ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
> [root@dl-disk1 ~]# cat /etc/ceph/ceph.conf
> [global]
> fsid = (hidden)
> mon_initial_members = dl-disk1, dl-disk2, dl-disk3
> mon_host = (hidden)
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> filestore_xattr_use_omap = true
>
> OSD tree:
> # ceph osd tree
> ID WEIGHT    TYPE NAME         UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 258.88000 root default
> -2  87.28000     host dl-disk1
>  0  10.90999         osd.0          up  1.00000          1.00000
>  1  10.90999         osd.1          up  1.00000          1.00000
>  2  10.90999         osd.2          up  1.00000          1.00000
>  3  10.90999         osd.3          up  1.00000          1.00000
>  4  10.90999         osd.4          up  1.00000          1.00000
>  5  10.90999         osd.5          up  1.00000          1.00000
>  6  10.90999         osd.6          up  1.00000          1.00000
>  7  10.90999         osd.7          up  1.00000          1.00000
> -3  87.28000     host dl-disk2
>  8  10.90999         osd.8          up  1.00000          1.00000
>  9  10.90999         osd.9          up  1.00000          1.00000
> 10  10.90999         osd.10         up  1.00000          1.00000
> 11  10.90999         osd.11         up  1.00000          1.00000
> 12  10.90999         osd.12         up  1.00000          1.00000
> 13  10.90999         osd.13         up  1.00000          1.00000
> 14  10.90999         osd.14         up  1.00000          1.00000
> 15  10.90999         osd.15         up  1.00000          1.00000
> -4  84.31999     host dl-disk3
> 16  10.53999         osd.16         up  1.00000          1.00000
> 17  10.53999         osd.17         up  1.00000          1.00000
> 18  10.53999         osd.18         up  1.00000          1.00000
> 19  10.53999         osd.19         up  1.00000          1.00000
> 20  10.53999         osd.20         up  1.00000          1.00000
> 21  10.53999         osd.21         up  1.00000          1.00000
> 22  10.53999         osd.22         up  1.00000          1.00000
> 23  10.53999         osd.23         up  1.00000          1.00000
>
> Pools and PG (each pool has 128 PGs):
> # ceph osd lspools
> 0 rbd,2 fs_meta,3 fs_data0,4 fs_data1,
> # ceph pg dump pools
> dumped pools in format plain
> pg_stat objects mip     degr    misp    unf     bytes   log     disklog
> pool 0  0       0       0       0       0       0       0       0
> pool 2  20      0       0       0       0       356958  264     264
> pool 3  3264    0       0       0       0       16106127360     14657
> 14657
> pool 4  0       0       0       0       0       0       0       0
>
> To simplify the problem, I made a new crush rule that the CephFS data pool
> use OSDs on only one machine (dl-disk1 here), and size = 1.
> # ceph osd crush rule dump osd_in_dl-disk1__ruleset
> {
>     "rule_id": 1,
>     "rule_name": "osd_in_dl-disk1__ruleset",
>     "ruleset": 1,
>     "type": 1,
>     "min_size": 1,
>     "max_size": 10,
>     "steps": [
>         {
>             "op": "take",
>             "item": -2,
>             "item_name": "dl-disk1"
>         },
>         {
>             "op": "chooseleaf_firstn",
>             "num": 0,
>             "type": "osd"
>         },
>         {
>             "op": "emit"
>         }
>     ]
> }
> # ceph osd pool get fs_data0 crush_ruleset
> crush_ruleset: 1
> # ceph osd pool get fs_data0 size
> size: 1
>
> Here starts the test.
> On an client machine, I used dd to write a 4GB-file to CephFS, and checked
> dstat on the OSD node dl-disk1:
> [root@client ~]# dd of=/mnt/cephfs/4Gfile if=/dev/zero bs=4096k count=1024
> 1024+0 records in
> 1024+0 records out
> 4294967296 bytes (4.3 GB) copied, 3.69993 s, 1.2 GB/s
>
> [root@dl-disk1 ~]# dstat ...
> ---total-cpu-usage---- ------memory-usage----- -net/total-
> --dsk/sdb-----dsk/sdc-----dsk/sdd-----dsk/sde-----dsk/sdf-----dsk/sdg-----dsk/sdh-----dsk/sdi--
> usr sys idl wai hiq siq| used  buff  cach  free| recv  send| read  writ:
> read  writ: read  writ: read  writ: read  writ: read  writ: read  writ: read
> writ
>
>   0   0 100   0   0   0|3461M 67.2M 15.1G 44.3G|  19k   20k|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   0   0 100   0   0   0|3461M 67.2M 15.1G 44.3G|  32k   32k|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   8  18  74   0   0   0|3364M 67.2M 11.1G 48.4G| 391k  391k|   0  2712k:   0
> 1096k:   0   556k:   0  1084k:   0  1200k:   0  1196k:   0   688k:   0
> 1252k
>   0   0 100   0   0   0|3364M 67.2M 11.1G 48.4G|  82k  127k|   0     0 :   0
> 0 :   0     0 :   0   928k:   0   540k:   0     0 :   0     0 :   0     0
>   8  16  72   3   0   1|3375M 67.2M 11.8G 47.7G| 718M 2068k|   0   120M:   0
> 172M:   0    76M:   0   220M:   0   188M:  16k  289M:   0    53M:   0    36M
>   6  13  77   4   0   1|3391M 67.2M 12.3G 47.1G| 553M 1517k|   0   160M:   0
> 176M:   0    88M:   0   208M:   0   225M:   0   213M:   0  8208k:   0    49M
>   6  13  77   3   0   1|3408M 67.2M 12.9G 46.6G| 544M 1272k|   0   212M:   0
> 8212k:   0    36M:   0     0 :   0    37M:   0  3852k:   0   497M:   0
> 337M
>   0   0  99   0   0   0|3407M 67.3M 12.9G 46.6G|  53k  114k|   0    36M:   0
> 37M:   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   0   0 100   0   0   0|3407M 67.3M 12.9G 46.6G|  68k  110k|   0     0 :   0
> 0 :   0     0 :   0    36M:   0     0 :   0     0 :   0     0 :   0     0
>   0   0  99   0   0   0|3407M 67.3M 12.9G 46.6G|  38k  328k|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0    36M:   0     0
>   0   1  99   0   0   0|3406M 67.3M 12.9G 46.6G|  11M  132k|   0     0 :   0
> 0 :   0  8224k:   0     0 :   0     0 :   0    32M:   0     0 :   0    36M
>  14  24  52   8   0   2|3436M 67.3M 13.8G 45.6G|1026M 2897k|   0   100M:   0
> 409M:   0   164M:   0   313M:   0   253M:   0   321M:   0    84M:   0    76M
>  14  24  34  27   0   1|3461M 67.3M 14.7G 44.7G| 990M 2565k|   0   354M:   0
> 72M:   0     0 :   0   164M:   0   313M:   0   188M:   0   308M:   0   333M
>   4   9  70  16   0   0|3474M 67.3M 15.1G 44.3G| 269M  646k|   0   324M:   0
> 0 :   0     0 :   0    36M:   0     0 :   0     0 :   0   349M:   0   172M
>   0   0  99   0   0   0|3474M 67.3M 15.1G 44.3G|  24k  315k|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0    37M:   0     0
>   0   0  99   0   0   0|3474M 67.4M 15.1G 44.3G|  38k  102k|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0    36M:   0     0 :   0    36M
>   0   0  99   0   0   0|3473M 67.4M 15.1G 44.3G|  22k   23k|   0     0 :   0
> 0 :   0    36M:   0     0 :   0    36M:   0     0 :   0     0 :   0     0
>   0   0 100   0   0   0|3473M 67.4M 15.1G 44.3G|  39k   40k|   0   304k:   0
> 16k:   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   0   0 100   0   0   0|3472M 67.4M 15.1G 44.3G|  28k   64k|   0    64M:   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   0   0 100   0   0   0|3471M 67.4M 15.1G 44.3G|  31k   94k|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   0   0 100   0   0   0|3472M 67.4M 15.1G 44.3G|  38k   39k|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>
> The throughput is 1.2 GB/s, able to reach the network limit 10Gb.
>
> Then, on the client machine, I used dd to read that file back from CephFS,
> redirecting the file to /dev/zero (or /dev/null) to rule out local HDD's IO:
> [root@client ~]# dd if=/mnt/cephfs/4Gfile of=/dev/zero bs=4096k count=1024
> 1024+0 records in
> 1024+0 records out
> 4294967296 bytes (4.3 GB) copied, 8.85246 s, 485 MB/s
>
> [root@dl-disk1 ~]# dstat ...
>   0   0 100   0   0   0|3462M 67.4M 15.1G 44.3G|  36k   36k|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   0   0 100   0   0   0|3462M 67.4M 15.1G 44.3G|  22k   22k|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   0   0 100   0   0   0|3463M 67.4M 15.1G 44.3G|  49k   49k|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   0   1  99   0   0   0|3464M 67.4M 15.1G 44.3G| 282k  111M|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   2   5  93   0   0   0|3466M 67.4M 15.1G 44.3G|1171k  535M|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   2   5  93   0   0   0|3467M 67.4M 15.1G 44.3G|1124k  535M|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   2   4  94   0   0   0|3467M 67.4M 15.1G 44.3G|1124k  535M|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   2   4  94   0   0   0|3467M 67.4M 15.1G 44.3G|1109k  527M|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   2   4  93   0   0   0|3471M 67.4M 15.1G 44.3G|1044k  504M|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   2   4  94   0   0   0|3470M 67.4M 15.1G 44.3G|1031k  504M|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   2   5  93   0   0   0|3470M 67.4M 15.1G 44.3G|1103k  527M|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   2   4  93   0   0   0|3471M 67.5M 15.1G 44.3G|1084k  504M|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   0   0 100   0   0   0|3470M 67.5M 15.1G 44.3G|  25k   24k|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
> ----total-cpu-usage---- ------memory-usage----- -net/total-
> --dsk/sdb-----dsk/sdc-----dsk/sdd-----dsk/sde-----dsk/sdf-----dsk/sdg-----dsk/sdh-----dsk/sdi--
> usr sys idl wai hiq siq| used  buff  cach  free| recv  send| read  writ:
> read  writ: read  writ: read  writ: read  writ: read  writ: read  writ: read
> writ
>   0   0 100   0   0   0|3470M 67.5M 15.1G 44.3G|  43k   44k|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   0   0 100   0   0   0|3470M 67.5M 15.1G 44.3G|  22k   23k|   0    48k:   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   0   0 100   0   0   0|3469M 67.5M 15.1G 44.3G|  35k   38k|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   0   0 100   0   0   0|3469M 67.5M 15.1G 44.3G|  23k   85k|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   0   0 100   0   0   0|3469M 67.5M 15.1G 44.3G|  44k   44k|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   0   0 100   0   0   0|3469M 67.5M 15.1G 44.3G|  24k   25k|   0    12k:   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   0   0 100   0   0   0|3469M 67.5M 15.1G 44.3G|  45k   43k|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   0   0 100   0   0   0|3468M 67.5M 15.1G 44.3G|  17k   18k|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>
>
> The throughput here was only 400~500 MB/s here.
> I noticed that there was NO disk I/O during the read-out, that means all the
> objects of the file were already cached in memory on the OSD node.
> Thus, HDDs does NOT seem to cause the lower throughput.
>
> I also tried read-out using cat  (in case dd may not use read-ahead in file
> system. ), ended up getting similar result:
>
> [root@client ~]# time cat /mnt/cephfs/4Gfile > /dev/zero
>
> real    0m9.352s
> user    0m0.002s
> sys     0m4.147s
>
>
> [root@dl-disk1 ~]# dstat ...
>   0   0 100   0   0   0|3465M 67.5M 15.1G 44.3G|  23k   22k|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   0   0 100   0   0   0|3465M 67.5M 15.1G 44.3G|  17k   18k|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   0   0 100   0   0   0|3465M 67.5M 15.1G 44.3G|  37k   37k|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   1   2  97   0   0   0|3466M 67.5M 15.1G 44.3G| 633k  280M|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   2   4  94   0   0   0|3467M 67.5M 15.1G 44.3G|1057k  498M|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   2   4  94   0   0   0|3470M 67.5M 15.1G 44.3G|1078k  498M|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   2   4  94   0   0   0|3470M 67.5M 15.1G 44.3G| 996k  486M|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   2   4  94   0   0   0|3469M 67.5M 15.1G 44.3G| 988k  489M|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   2   4  94   0   0   0|3469M 67.5M 15.1G 44.3G|1012k  489M|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   2   4  94   0   0   0|3470M 67.5M 15.1G 44.3G|1017k  497M|   0     0 :   0
> 8192B:   0    28k:   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   2   4  94   0   0   0|3469M 67.5M 15.1G 44.3G|1032k  498M|   0     0 :   0
> 0 :   0     0 :   0  8192B:   0   104k:   0     0 :   0     0 :   0     0
> ----total-cpu-usage---- ------memory-usage----- -net/total-
> --dsk/sdb-----dsk/sdc-----dsk/sdd-----dsk/sde-----dsk/sdf-----dsk/sdg-----dsk/sdh-----dsk/sdi--
> usr sys idl wai hiq siq| used  buff  cach  free| recv  send| read  writ:
> read  writ: read  writ: read  writ: read  writ: read  writ: read  writ: read
> writ
>   2   4  94   0   0   0|3469M 67.5M 15.1G 44.3G|1025k  496M|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0    40k:   0    80k:   0     0
>   0   1  99   0   0   0|3469M 67.5M 15.1G 44.3G| 127k   52M|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0   120k
>   0   0 100   0   0   0|3469M 67.5M 15.1G 44.3G|  21k   21k|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   0   0 100   0   0   0|3469M 67.5M 15.1G 44.3G|  66k   66k|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>   0   0 100   0   0   0|3469M 67.5M 15.1G 44.3G|  35k   38k|   0     0 :   0
> 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0
>
>
> The average throughput is 4GB / 9.35s = 438 MB/s. Still, unlikely to be
> HDD's issue.
>
> I'm sure that the network can reach 10Gb in both ways via iperf or other
> test, and there's no other user process occupying bandwidth.
>
> Could you please help me some to find out the main reason for this issue?
> Thank you.
>
> Best Regards,
> FaHui
>
>
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com