Re: How to increase the size of requests written to a ceph image

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Wed, 18 Oct 2017 15:51:53 +0200

First a general comment: local RAID will be faster than Ceph for a single threaded (queue depth=1) io operation test. A single thread Ceph client will see at best same disk speed for reads and for writes 4-6 times slower than single disk. Not to mention the latency of local disks will much better. Where Ceph shines is when you have many concurrent ios, it scales whereas RAID will decrease speed per client as you add more.
Having said that, i would recommend running rados/rbd bench-write and measure 4k iops at 1 and 32 threads to get a better idea of how your cluster performs:
ceph osd pool create testpool 256 256 
rados bench -p testpool -b 4096 30 write -t 1
rados bench -p testpool -b 4096 30 write -t 32 
ceph osd pool delete testpool testpool --yes-i-really-really-mean-it
rbd bench-write test-image --io-threads=1 --io-size 4096 --io-pattern rand --rbd_cache=false
rbd bench-write test-image --io-threads=32 --io-size 4096 --io-pattern rand --rbd_cache=false
I think the request size difference you see is due to the io scheduler in the case of local disks having more ios to re-group so has a better chance in generating larger requests. Depending on your kernel, the io scheduler may be different for rbd (blq-mq) vs sdx (cfq) but again i would think the request size is a result not a cause.
Maged
On 2017-10-17 23:12, Russell Glaue wrote:

I am running ceph jewel on 5 nodes with SSD OSDs.
I have an LVM image on a local RAID of spinning disks.
I have an RBD image on in a pool of SSD disks.

Both disks are used to run an almost identical CentOS 7 system.
Both systems were installed with the same kickstart, though the disk partitioning is different.

I want to make writes on the the ceph image faster. For example, lots of writes to MySQL (via MySQL replication) on a ceph SSD image are about 10x slower than on a spindle RAID disk image. The MySQL server on ceph rbd image has a hard time keeping up in replication.

So I wanted to test writes on these two systems
I have a 10GB compressed (gzip) file on both servers.
I simply gunzip the file on both systems, while running iostat.

The primary difference I see in the results is the average size of the request to the disk.
CentOS7-lvm-raid-sata writes a lot faster to disk, and the size of the request is about 40x, but the number of writes per second is about the same
This makes me want to conclude that the smaller size of the request for CentOS7-ceph-rbd-ssd system is the cause of it being slow.

How can I make the size of the request larger for ceph rbd images, so I can increase the write throughput?
Would this be related to having jumbo packets enabled in my ceph storage network?

Here is a sample of the results:

[CentOS7-lvm-raid-sata]
$ gunzip large10gFile.gz &
$ iostat -x vg_root-lv_var -d 5 -m -N
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
...
vg_root-lv_var     0.00     0.00   30.60  452.20    13.60   222.15  1000.04     8.69   14.05    0.99   14.93   2.07 100.04
vg_root-lv_var     0.00     0.00   88.20  182.00    39.20    89.43   974.95     4.65    9.82    0.99   14.10   3.70 100.00
vg_root-lv_var     0.00     0.00   75.45  278.24    33.53   136.70   985.73     4.36   33.26    1.34   41.91   0.59  20.84
vg_root-lv_var     0.00     0.00  111.60  181.80    49.60    89.34   969.84     2.60    8.87    0.81   13.81   0.13   3.90
vg_root-lv_var     0.00     0.00   68.40  109.60    30.40    53.63   966.87     1.51    8.46    0.84   13.22   0.80  14.16
...

[CentOS7-ceph-rbd-ssd]
$ gunzip large10gFile.gz &
$ iostat -x vg_root-lv_data -d 5 -m -N
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
...
vg_root-lv_data     0.00     0.00   46.40  167.80     0.88     1.46    22.36     1.23    5.66    2.47    6.54   4.52  96.82
vg_root-lv_data     0.00     0.00   16.60   55.20     0.36     0.14    14.44     0.99   13.91    9.12   15.36  13.71  98.46
vg_root-lv_data     0.00     0.00   69.00  173.80     1.34     1.32    22.48     1.25    5.19    3.77    5.75   3.94  95.68
vg_root-lv_data     0.00     0.00   74.40  293.40     1.37     1.47    15.83     1.22    3.31    2.06    3.63   2.54  93.26
vg_root-lv_data     0.00     0.00   90.80  359.00     1.96     3.41    24.45     1.63    3.63    1.94    4.05   2.10  94.38
...

[iostat key]
w/s == The number (after merges) of write requests completed per second for the device.
wMB/s == The number of sectors (kilobytes, megabytes) written to the device per second.
avgrq-sz == The average size (in kilobytes) of the requests that were issued to the device.
avgqu-sz == The average queue length of the requests that were issued to the device.

_______________________________________________
 ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com