Re: Read-out much slower than write-in on my ceph cluster

Nick Fisk <nick@xxxxxxxxxx> · Wed, 28 Oct 2015 16:24:09 -0000

Could you try with this kernel and bump the readahead on the RBD device up
to at least 32MB?

http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/ra-bring-back
/

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Eric Eastman
> Sent: 28 October 2015 16:05
> To: FaHui Lin <fahui.lin@xxxxxxxxxx>
> Cc: Ceph Users <ceph-users@xxxxxxxxxxxxxx>; ?奇? 先生
> <edward.wu@xxxxxxxxxx>
> Subject: Re:  Read-out much slower than write-in on my ceph
> cluster
> 
> On the RBD performance issue, you may want to look at:
> http://tracker.ceph.com/issues/9192
> 
> Eric
> 
> On Tue, Oct 27, 2015 at 8:59 PM, FaHui Lin <fahui.lin@xxxxxxxxxx> wrote:
> > Dear Ceph experts,
> >
> > I found something strange about the performance of my Ceph cluster:
> > Read-out much slower than write-in.
> >
> > I have 3 machines running OSDs, each has 8 OSDs running on 8 raid0s
> > (each made up of 2 HDDs) respectively. The OSD journal and data the is
> > on the same device.  All machines in my clusters have 10Gb network.
> >
> > I used both Ceph RBD and CephFS, the client on another machine outside
> > cluster or on one of the running OSD (to rule out possible network
> > issue), an so on. All of these end up in a similar results: write-in
> > can almost reach the network limit, say 1200 MB/s, while read-out is
only
> 350~450 MB/s.
> >
> > Trying to figure out, I did an extra test using CephFS:
> >
> > Version and Config:
> > [root@dl-disk1 ~]# ceph --version
> > ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
> > [root@dl-disk1 ~]# cat /etc/ceph/ceph.conf [global] fsid = (hidden)
> > mon_initial_members = dl-disk1, dl-disk2, dl-disk3 mon_host = (hidden)
> > auth_cluster_required = cephx auth_service_required = cephx
> > auth_client_required = cephx filestore_xattr_use_omap = true
> >
> > OSD tree:
> > # ceph osd tree
> > ID WEIGHT    TYPE NAME         UP/DOWN REWEIGHT PRIMARY-AFFINITY
> > -1 258.88000 root default
> > -2  87.28000     host dl-disk1
> >  0  10.90999         osd.0          up  1.00000          1.00000
> >  1  10.90999         osd.1          up  1.00000          1.00000
> >  2  10.90999         osd.2          up  1.00000          1.00000
> >  3  10.90999         osd.3          up  1.00000          1.00000
> >  4  10.90999         osd.4          up  1.00000          1.00000
> >  5  10.90999         osd.5          up  1.00000          1.00000
> >  6  10.90999         osd.6          up  1.00000          1.00000
> >  7  10.90999         osd.7          up  1.00000          1.00000
> > -3  87.28000     host dl-disk2
> >  8  10.90999         osd.8          up  1.00000          1.00000
> >  9  10.90999         osd.9          up  1.00000          1.00000
> > 10  10.90999         osd.10         up  1.00000          1.00000
> > 11  10.90999         osd.11         up  1.00000          1.00000
> > 12  10.90999         osd.12         up  1.00000          1.00000
> > 13  10.90999         osd.13         up  1.00000          1.00000
> > 14  10.90999         osd.14         up  1.00000          1.00000
> > 15  10.90999         osd.15         up  1.00000          1.00000
> > -4  84.31999     host dl-disk3
> > 16  10.53999         osd.16         up  1.00000          1.00000
> > 17  10.53999         osd.17         up  1.00000          1.00000
> > 18  10.53999         osd.18         up  1.00000          1.00000
> > 19  10.53999         osd.19         up  1.00000          1.00000
> > 20  10.53999         osd.20         up  1.00000          1.00000
> > 21  10.53999         osd.21         up  1.00000          1.00000
> > 22  10.53999         osd.22         up  1.00000          1.00000
> > 23  10.53999         osd.23         up  1.00000          1.00000
> >
> > Pools and PG (each pool has 128 PGs):
> > # ceph osd lspools
> > 0 rbd,2 fs_meta,3 fs_data0,4 fs_data1, # ceph pg dump pools dumped
> > pools in format plain
> > pg_stat objects mip     degr    misp    unf     bytes   log     disklog
> > pool 0  0       0       0       0       0       0       0       0
> > pool 2  20      0       0       0       0       356958  264     264
> > pool 3  3264    0       0       0       0       16106127360     14657
> > 14657
> > pool 4  0       0       0       0       0       0       0       0
> >
> > To simplify the problem, I made a new crush rule that the CephFS data
> > pool use OSDs on only one machine (dl-disk1 here), and size = 1.
> > # ceph osd crush rule dump osd_in_dl-disk1__ruleset {
> >     "rule_id": 1,
> >     "rule_name": "osd_in_dl-disk1__ruleset",
> >     "ruleset": 1,
> >     "type": 1,
> >     "min_size": 1,
> >     "max_size": 10,
> >     "steps": [
> >         {
> >             "op": "take",
> >             "item": -2,
> >             "item_name": "dl-disk1"
> >         },
> >         {
> >             "op": "chooseleaf_firstn",
> >             "num": 0,
> >             "type": "osd"
> >         },
> >         {
> >             "op": "emit"
> >         }
> >     ]
> > }
> > # ceph osd pool get fs_data0 crush_ruleset
> > crush_ruleset: 1
> > # ceph osd pool get fs_data0 size
> > size: 1
> >
> > Here starts the test.
> > On an client machine, I used dd to write a 4GB-file to CephFS, and
> > checked dstat on the OSD node dl-disk1:
> > [root@client ~]# dd of=/mnt/cephfs/4Gfile if=/dev/zero bs=4096k
> > count=1024
> > 1024+0 records in
> > 1024+0 records out
> > 4294967296 bytes (4.3 GB) copied, 3.69993 s, 1.2 GB/s
> >
> > [root@dl-disk1 ~]# dstat ...
> > ---total-cpu-usage---- ------memory-usage----- -net/total-
> > --dsk/sdb-----dsk/sdc-----dsk/sdd-----dsk/sde-----dsk/sdf-----dsk/sdg-
> > ----dsk/sdh-----dsk/sdi-- usr sys idl wai hiq siq| used  buff  cach
> > free| recv  send| read  writ:
> > read  writ: read  writ: read  writ: read  writ: read  writ: read
> > writ: read writ
> >
> >   0   0 100   0   0   0|3461M 67.2M 15.1G 44.3G|  19k   20k|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   0   0 100   0   0   0|3461M 67.2M 15.1G 44.3G|  32k   32k|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   8  18  74   0   0   0|3364M 67.2M 11.1G 48.4G| 391k  391k|   0  2712k:
0
> > 1096k:   0   556k:   0  1084k:   0  1200k:   0  1196k:   0   688k:   0
> > 1252k
> >   0   0 100   0   0   0|3364M 67.2M 11.1G 48.4G|  82k  127k|   0     0 :
0
> > 0 :   0     0 :   0   928k:   0   540k:   0     0 :   0     0 :   0
0
> >   8  16  72   3   0   1|3375M 67.2M 11.8G 47.7G| 718M 2068k|   0   120M:
0
> > 172M:   0    76M:   0   220M:   0   188M:  16k  289M:   0    53M:   0
36M
> >   6  13  77   4   0   1|3391M 67.2M 12.3G 47.1G| 553M 1517k|   0   160M:
0
> > 176M:   0    88M:   0   208M:   0   225M:   0   213M:   0  8208k:   0
49M
> >   6  13  77   3   0   1|3408M 67.2M 12.9G 46.6G| 544M 1272k|   0   212M:
0
> > 8212k:   0    36M:   0     0 :   0    37M:   0  3852k:   0   497M:   0
> > 337M
> >   0   0  99   0   0   0|3407M 67.3M 12.9G 46.6G|  53k  114k|   0    36M:
0
> > 37M:   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   0   0 100   0   0   0|3407M 67.3M 12.9G 46.6G|  68k  110k|   0     0 :
0
> > 0 :   0     0 :   0    36M:   0     0 :   0     0 :   0     0 :   0
0
> >   0   0  99   0   0   0|3407M 67.3M 12.9G 46.6G|  38k  328k|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0    36M:   0
0
> >   0   1  99   0   0   0|3406M 67.3M 12.9G 46.6G|  11M  132k|   0     0 :
0
> > 0 :   0  8224k:   0     0 :   0     0 :   0    32M:   0     0 :   0
36M
> >  14  24  52   8   0   2|3436M 67.3M 13.8G 45.6G|1026M 2897k|   0   100M:
0
> > 409M:   0   164M:   0   313M:   0   253M:   0   321M:   0    84M:   0
76M
> >  14  24  34  27   0   1|3461M 67.3M 14.7G 44.7G| 990M 2565k|   0   354M:
0
> > 72M:   0     0 :   0   164M:   0   313M:   0   188M:   0   308M:   0
333M
> >   4   9  70  16   0   0|3474M 67.3M 15.1G 44.3G| 269M  646k|   0   324M:
0
> > 0 :   0     0 :   0    36M:   0     0 :   0     0 :   0   349M:   0
172M
> >   0   0  99   0   0   0|3474M 67.3M 15.1G 44.3G|  24k  315k|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0    37M:   0
0
> >   0   0  99   0   0   0|3474M 67.4M 15.1G 44.3G|  38k  102k|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0    36M:   0     0 :   0
36M
> >   0   0  99   0   0   0|3473M 67.4M 15.1G 44.3G|  22k   23k|   0     0 :
0
> > 0 :   0    36M:   0     0 :   0    36M:   0     0 :   0     0 :   0
0
> >   0   0 100   0   0   0|3473M 67.4M 15.1G 44.3G|  39k   40k|   0   304k:
0
> > 16k:   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   0   0 100   0   0   0|3472M 67.4M 15.1G 44.3G|  28k   64k|   0    64M:
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   0   0 100   0   0   0|3471M 67.4M 15.1G 44.3G|  31k   94k|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   0   0 100   0   0   0|3472M 67.4M 15.1G 44.3G|  38k   39k|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >
> > The throughput is 1.2 GB/s, able to reach the network limit 10Gb.
> >
> > Then, on the client machine, I used dd to read that file back from
> > CephFS, redirecting the file to /dev/zero (or /dev/null) to rule out
local
> HDD's IO:
> > [root@client ~]# dd if=/mnt/cephfs/4Gfile of=/dev/zero bs=4096k
> > count=1024
> > 1024+0 records in
> > 1024+0 records out
> > 4294967296 bytes (4.3 GB) copied, 8.85246 s, 485 MB/s
> >
> > [root@dl-disk1 ~]# dstat ...
> >   0   0 100   0   0   0|3462M 67.4M 15.1G 44.3G|  36k   36k|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   0   0 100   0   0   0|3462M 67.4M 15.1G 44.3G|  22k   22k|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   0   0 100   0   0   0|3463M 67.4M 15.1G 44.3G|  49k   49k|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   0   1  99   0   0   0|3464M 67.4M 15.1G 44.3G| 282k  111M|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   2   5  93   0   0   0|3466M 67.4M 15.1G 44.3G|1171k  535M|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   2   5  93   0   0   0|3467M 67.4M 15.1G 44.3G|1124k  535M|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   2   4  94   0   0   0|3467M 67.4M 15.1G 44.3G|1124k  535M|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   2   4  94   0   0   0|3467M 67.4M 15.1G 44.3G|1109k  527M|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   2   4  93   0   0   0|3471M 67.4M 15.1G 44.3G|1044k  504M|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   2   4  94   0   0   0|3470M 67.4M 15.1G 44.3G|1031k  504M|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   2   5  93   0   0   0|3470M 67.4M 15.1G 44.3G|1103k  527M|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   2   4  93   0   0   0|3471M 67.5M 15.1G 44.3G|1084k  504M|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   0   0 100   0   0   0|3470M 67.5M 15.1G 44.3G|  25k   24k|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> > ----total-cpu-usage---- ------memory-usage----- -net/total-
> > --dsk/sdb-----dsk/sdc-----dsk/sdd-----dsk/sde-----dsk/sdf-----dsk/sdg-
> > ----dsk/sdh-----dsk/sdi-- usr sys idl wai hiq siq| used  buff  cach
> > free| recv  send| read  writ:
> > read  writ: read  writ: read  writ: read  writ: read  writ: read
> > writ: read writ
> >   0   0 100   0   0   0|3470M 67.5M 15.1G 44.3G|  43k   44k|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   0   0 100   0   0   0|3470M 67.5M 15.1G 44.3G|  22k   23k|   0    48k:
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   0   0 100   0   0   0|3469M 67.5M 15.1G 44.3G|  35k   38k|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   0   0 100   0   0   0|3469M 67.5M 15.1G 44.3G|  23k   85k|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   0   0 100   0   0   0|3469M 67.5M 15.1G 44.3G|  44k   44k|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   0   0 100   0   0   0|3469M 67.5M 15.1G 44.3G|  24k   25k|   0    12k:
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   0   0 100   0   0   0|3469M 67.5M 15.1G 44.3G|  45k   43k|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   0   0 100   0   0   0|3468M 67.5M 15.1G 44.3G|  17k   18k|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >
> >
> > The throughput here was only 400~500 MB/s here.
> > I noticed that there was NO disk I/O during the read-out, that means
> > all the objects of the file were already cached in memory on the OSD
node.
> > Thus, HDDs does NOT seem to cause the lower throughput.
> >
> > I also tried read-out using cat  (in case dd may not use read-ahead in
> > file system. ), ended up getting similar result:
> >
> > [root@client ~]# time cat /mnt/cephfs/4Gfile > /dev/zero
> >
> > real    0m9.352s
> > user    0m0.002s
> > sys     0m4.147s
> >
> >
> > [root@dl-disk1 ~]# dstat ...
> >   0   0 100   0   0   0|3465M 67.5M 15.1G 44.3G|  23k   22k|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   0   0 100   0   0   0|3465M 67.5M 15.1G 44.3G|  17k   18k|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   0   0 100   0   0   0|3465M 67.5M 15.1G 44.3G|  37k   37k|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   1   2  97   0   0   0|3466M 67.5M 15.1G 44.3G| 633k  280M|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   2   4  94   0   0   0|3467M 67.5M 15.1G 44.3G|1057k  498M|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   2   4  94   0   0   0|3470M 67.5M 15.1G 44.3G|1078k  498M|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   2   4  94   0   0   0|3470M 67.5M 15.1G 44.3G| 996k  486M|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   2   4  94   0   0   0|3469M 67.5M 15.1G 44.3G| 988k  489M|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   2   4  94   0   0   0|3469M 67.5M 15.1G 44.3G|1012k  489M|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   2   4  94   0   0   0|3470M 67.5M 15.1G 44.3G|1017k  497M|   0     0 :
0
> > 8192B:   0    28k:   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   2   4  94   0   0   0|3469M 67.5M 15.1G 44.3G|1032k  498M|   0     0 :
0
> > 0 :   0     0 :   0  8192B:   0   104k:   0     0 :   0     0 :   0
0
> > ----total-cpu-usage---- ------memory-usage----- -net/total-
> > --dsk/sdb-----dsk/sdc-----dsk/sdd-----dsk/sde-----dsk/sdf-----dsk/sdg-
> > ----dsk/sdh-----dsk/sdi-- usr sys idl wai hiq siq| used  buff  cach
> > free| recv  send| read  writ:
> > read  writ: read  writ: read  writ: read  writ: read  writ: read
> > writ: read writ
> >   2   4  94   0   0   0|3469M 67.5M 15.1G 44.3G|1025k  496M|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0    40k:   0    80k:   0
0
> >   0   1  99   0   0   0|3469M 67.5M 15.1G 44.3G| 127k   52M|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
120k
> >   0   0 100   0   0   0|3469M 67.5M 15.1G 44.3G|  21k   21k|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   0   0 100   0   0   0|3469M 67.5M 15.1G 44.3G|  66k   66k|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >   0   0 100   0   0   0|3469M 67.5M 15.1G 44.3G|  35k   38k|   0     0 :
0
> > 0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0
0
> >
> >
> > The average throughput is 4GB / 9.35s = 438 MB/s. Still, unlikely to
> > be HDD's issue.
> >
> > I'm sure that the network can reach 10Gb in both ways via iperf or
> > other test, and there's no other user process occupying bandwidth.
> >
> > Could you please help me some to find out the main reason for this
issue?
> > Thank you.
> >
> > Best Regards,
> > FaHui
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com