Re: How to increase the size of requests written to a ceph image

Russell Glaue <rglaue@xxxxxxxx> · Wed, 18 Oct 2017 11:13:10 -0500

In my previous post, in one of my points I was wondering if the request size would increase if I enabled jumbo packets. currently it is disabled.

@jdillama: The qemu settings for both these two guest machines, with RAID/LVM and Ceph/rbd images, are the same. I am not thinking that changing the qemu settings of "min_io_size=<limited to 16bits>,opt_io_size=<RBD image object size>" will directly address the issue.

@mmokhtar: Ok. So you suggest the request size is the result of the problem and not the cause of the problem. meaning I should go after a different issue.
I have been trying to get write speeds up to what people on this mail list are discussing.
It seems that for our configuration, as it matches others, we should be getting about 70MB/s write speed.
But we are not getting that.
Single writes to disk are lucky to get 5MB/s to 6MB/s, but are typically 1MB/s to 2MB/s.
Monitoring the entire Ceph cluster (using http://cephdash.crapworks.de/), I have seen very rare momentary spikes up to 30MB/s.

My storage network is connected via a 10Gb switch
I have 4 storage servers with a LSI Logic MegaRAID SAS 2208 controller
Each storage server has 9 1TB SSD drives, each drive as 1 osd (no RAID)
Each drive is one LVM group, with two volumes - one volume for the osd, one volume for the journal
Each osd is formatted with xfs
The crush map is simple: default->rack->[host[1..4]->osd] with an evenly distributed weight
The redundancy is triple replication

While I have read comments that having the osd and journal on the same disk decreases write speed, I have also read that once past 8 OSDs per node this is the recommended configuration, however this is also the reason why SSD drives are used exclusively for OSDs in the storage nodes.
None-the-less, I was still expecting write speeds to be above 30MB/s, not below 6MB/s.
Even at 12x slower than the RAID, using my previously posted iostat data set, I should be seeing write speeds that average 10MB/s, not 2MB/s.

In regards to the rados benchmark tests you asked me to run, here is the output:

[centos7]# rados bench -p scbench -b 4096 30 write -t 1
Maintaining 1 concurrent writes of 4096 bytes to objects of size 4096 for up to 30 seconds or 0 objects
Object prefix: benchmark_data_hamms.sys.cu.cait.org_85049
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1       1       201       200   0.78356   0.78125  0.00522307  0.00496574
    2       1       469       468  0.915303   1.04688  0.00437497  0.00426141
    3       1       741       740  0.964371    1.0625  0.00512853   0.0040434
    4       1       888       887  0.866739  0.574219  0.00307699  0.00450177
    5       1      1147      1146  0.895725   1.01172  0.00376454   0.0043559
    6       1      1325      1324  0.862293  0.695312  0.00459443    0.004525
    7       1      1494      1493   0.83339  0.660156  0.00461002  0.00458452
    8       1      1736      1735  0.847369  0.945312  0.00253971  0.00460458
    9       1      1998      1997  0.866922   1.02344  0.00236573  0.00450172
   10       1      2260      2259  0.882563   1.02344  0.00262179  0.00442152
   11       1      2526      2525  0.896775   1.03906  0.00336914  0.00435092
   12       1      2760      2759  0.898203  0.914062  0.00351827  0.00434491
   13       1      3016      3015  0.906025         1  0.00335703  0.00430691
   14       1      3257      3256  0.908545  0.941406  0.00332344  0.00429495
   15       1      3490      3489  0.908644  0.910156  0.00318815  0.00426387
   16       1      3728      3727  0.909952  0.929688   0.0032881  0.00428895
   17       1      3986      3985  0.915703   1.00781  0.00274809   0.0042614
   18       1      4250      4249  0.922116   1.03125  0.00287411  0.00423214
   19       1      4505      4504  0.926003  0.996094  0.00375435  0.00421442
2017-10-18 10:56:31.267173 min lat: 0.00181259 max lat: 0.270553 avg lat: 0.00420118
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   20       1      4757      4756  0.928915  0.984375  0.00463972  0.00420118
   21       1      5009      5008   0.93155  0.984375  0.00360065  0.00418937
   22       1      5235      5234  0.929329  0.882812  0.00626214    0.004199
   23       1      5500      5499  0.933925   1.03516  0.00466584  0.00417836
   24       1      5708      5707  0.928861    0.8125  0.00285727  0.00420146
   25       0      5964      5964  0.931858   1.00391  0.00417383   0.0041881
   26       1      6216      6215  0.933722  0.980469   0.0041009  0.00417915
   27       1      6481      6480  0.937474   1.03516  0.00307484  0.00416118
   28       1      6745      6744  0.940819   1.03125  0.00266329  0.00414777
   29       1      7003      7002  0.943124   1.00781  0.00305905  0.00413758
   30       1      7271      7270  0.946578   1.04688  0.00391017  0.00412238
Total time run:         30.006060
Total writes made:      7272
Write size:             4096
Object size:            4096
Bandwidth (MB/sec):     0.946684
Stddev Bandwidth:       0.123762
Max bandwidth (MB/sec): 1.0625
Min bandwidth (MB/sec): 0.574219
Average IOPS:           242
Stddev IOPS:            31
Max IOPS:               272
Min IOPS:               147
Average Latency(s):     0.00412247
Stddev Latency(s):      0.00648437
Max latency(s):         0.270553
Min latency(s):         0.00175318
Cleaning up (deleting benchmark objects)
Clean up completed and total clean up time :29.069423

[centos7]# rados bench -p scbench -b 4096 30 write -t 32
Maintaining 32 concurrent writes of 4096 bytes to objects of size 4096 for up to 30 seconds or 0 objects
Object prefix: benchmark_data_hamms.sys.cu.cait.org_86076
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      32      3013      2981   11.6438   11.6445  0.00247906  0.00572026
    2      32      5349      5317   10.3834     9.125  0.00246662  0.00932016
    3      32      5707      5675    7.3883   1.39844  0.00389774   0.0156726
    4      32      5895      5863   5.72481  0.734375     1.13137   0.0167946
    5      32      6869      6837   5.34068   3.80469   0.0027652   0.0226577
    6      32      8901      8869   5.77306    7.9375   0.0053211   0.0216259
    7      32     10800     10768   6.00785   7.41797  0.00358187   0.0207418
    8      32     11825     11793   5.75728   4.00391  0.00217575   0.0215494
    9      32     12941     12909    5.6019   4.35938  0.00278512   0.0220567
   10      32     13317     13285   5.18849   1.46875   0.0034973   0.0240665
   11      32     16189     16157   5.73653   11.2188  0.00255841   0.0212708
   12      32     16749     16717   5.44077    2.1875  0.00330334   0.0215915
   13      32     16756     16724   5.02436 0.0273438  0.00338994    0.021849
   14      32     17908     17876   4.98686       4.5  0.00402598   0.0244568
   15      32     17936     17904   4.66171  0.109375  0.00375799   0.0245545
   16      32     18279     18247   4.45409   1.33984  0.00483873   0.0267929
   17      32     18372     18340   4.21346  0.363281  0.00505187   0.0275887
   18      32     19403     19371   4.20309   4.02734  0.00545154    0.029348
   19      31     19845     19814   4.07295   1.73047  0.00254726   0.0306775
2017-10-18 10:57:58.160536 min lat: 0.0015005 max lat: 2.27707 avg lat: 0.0307559
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   20      31     20401     20370   3.97788   2.17188  0.00307238   0.0307559
   21      32     21338     21306   3.96254   3.65625  0.00464563   0.0312288
   22      32     23057     23025    4.0876   6.71484  0.00296295   0.0299267
   23      32     23057     23025   3.90988         0           -   0.0299267
   24      32     23803     23771   3.86837   1.45703  0.00301471   0.0312804
   25      32     24112     24080   3.76191   1.20703  0.00191063   0.0331462
   26      31     25303     25272   3.79629   4.65625  0.00794399   0.0329129
   27      32     28803     28771   4.16183    13.668   0.0109817   0.0297469
   28      32     29592     29560   4.12325   3.08203  0.00188185   0.0301911
   29      32     30595     30563   4.11616   3.91797  0.00379099   0.0296794
   30      32     31031     30999   4.03572   1.70312  0.00283347   0.0302411
Total time run:         30.822350
Total writes made:      31032
Write size:             4096
Object size:            4096
Bandwidth (MB/sec):     3.93282
Stddev Bandwidth:       3.66265
Max bandwidth (MB/sec): 13.668
Min bandwidth (MB/sec): 0
Average IOPS:           1006
Stddev IOPS:            937
Max IOPS:               3499
Min IOPS:               0
Average Latency(s):     0.0317779
Stddev Latency(s):      0.164076
Max latency(s):         2.27707
Min latency(s):         0.0013848
Cleaning up (deleting benchmark objects)
Clean up completed and total clean up time :20.166559

On Wed, Oct 18, 2017 at 8:51 AM, Maged Mokhtar <mmokhtar@xxxxxxxxxxx> wrote:

First a general comment: local RAID will be faster than Ceph for a single threaded (queue depth=1) io operation test. A single thread Ceph client will see at best same disk speed for reads and for writes 4-6 times slower than single disk. Not to mention the latency of local disks will much better. Where Ceph shines is when you have many concurrent ios, it scales whereas RAID will decrease speed per client as you add more.
Having said that, i would recommend running rados/rbd bench-write and measure 4k iops at 1 and 32 threads to get a better idea of how your cluster performs:
ceph osd pool create testpool 256 256 
rados bench -p testpool -b 4096 30 write -t 1
rados bench -p testpool -b 4096 30 write -t 32 
ceph osd pool delete testpool testpool --yes-i-really-really-mean-it
rbd bench-write test-image --io-threads=1 --io-size 4096 --io-pattern rand --rbd_cache=false
rbd bench-write test-image --io-threads=32 --io-size 4096 --io-pattern rand --rbd_cache=false
I think the request size difference you see is due to the io scheduler in the case of local disks having more ios to re-group so has a better chance in generating larger requests. Depending on your kernel, the io scheduler may be different for rbd (blq-mq) vs sdx (cfq) but again i would think the request size is a result not a cause.

Maged
On 2017-10-17 23:12, Russell Glaue wrote:

I am running ceph jewel on 5 nodes with SSD OSDs.
I have an LVM image on a local RAID of spinning disks.
I have an RBD image on in a pool of SSD disks.

Both disks are used to run an almost identical CentOS 7 system.
Both systems were installed with the same kickstart, though the disk partitioning is different.

I want to make writes on the the ceph image faster. For example, lots of writes to MySQL (via MySQL replication) on a ceph SSD image are about 10x slower than on a spindle RAID disk image. The MySQL server on ceph rbd image has a hard time keeping up in replication.

So I wanted to test writes on these two systems
I have a 10GB compressed (gzip) file on both servers.
I simply gunzip the file on both systems, while running iostat.

The primary difference I see in the results is the average size of the request to the disk.
CentOS7-lvm-raid-sata writes a lot faster to disk, and the size of the request is about 40x, but the number of writes per second is about the same
This makes me want to conclude that the smaller size of the request for CentOS7-ceph-rbd-ssd system is the cause of it being slow.

How can I make the size of the request larger for ceph rbd images, so I can increase the write throughput?
Would this be related to having jumbo packets enabled in my ceph storage network?

Here is a sample of the results:

[CentOS7-lvm-raid-sata]
$ gunzip large10gFile.gz &
$ iostat -x vg_root-lv_var -d 5 -m -N
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
...
vg_root-lv_var     0.00     0.00   30.60  452.20    13.60   222.15  1000.04     8.69   14.05    0.99   14.93   2.07 100.04
vg_root-lv_var     0.00     0.00   88.20  182.00    39.20    89.43   974.95     4.65    9.82    0.99   14.10   3.70 100.00
vg_root-lv_var     0.00     0.00   75.45  278.24    33.53   136.70   985.73     4.36   33.26    1.34   41.91   0.59  20.84
vg_root-lv_var     0.00     0.00  111.60  181.80    49.60    89.34   969.84     2.60    8.87    0.81   13.81   0.13   3.90
vg_root-lv_var     0.00     0.00   68.40  109.60    30.40    53.63   966.87     1.51    8.46    0.84   13.22   0.80  14.16
...

[CentOS7-ceph-rbd-ssd]
$ gunzip large10gFile.gz &
$ iostat -x vg_root-lv_data -d 5 -m -N
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
...
vg_root-lv_data     0.00     0.00   46.40  167.80     0.88     1.46    22.36     1.23    5.66    2.47    6.54   4.52  96.82
vg_root-lv_data     0.00     0.00   16.60   55.20     0.36     0.14    14.44     0.99   13.91    9.12   15.36  13.71  98.46
vg_root-lv_data     0.00     0.00   69.00  173.80     1.34     1.32    22.48     1.25    5.19    3.77    5.75   3.94  95.68
vg_root-lv_data     0.00     0.00   74.40  293.40     1.37     1.47    15.83     1.22    3.31    2.06    3.63   2.54  93.26
vg_root-lv_data     0.00     0.00   90.80  359.00     1.96     3.41    24.45     1.63    3.63    1.94    4.05   2.10  94.38
...

[iostat key]
w/s == The number (after merges) of write requests completed per second for the device.
wMB/s == The number of sectors (kilobytes, megabytes) written to the device per second.
avgrq-sz == The average size (in kilobytes) of the requests that were issued to the device.
avgqu-sz == The average queue length of the requests that were issued to the device.

_______________________________________________
 ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com