Re: Optimizing terrible RBD performance

JC Lopez <jelopez@xxxxxxxxxx> · Fri, 4 Oct 2019 09:23:45 -0700

Hi,

your RBD bench and RADOS bench use by default 4MB IO request size while your FIO is configured for 4KB IO request size.

If you want to compare apple 2 apple (bandwidth) you need to change the FIO IO request size to 4194304. Plus, you tested a sequential workload with RADOS bench but random with fio.

Make sure you align all parameters to obtain results you can compare

Other note: What block size did you specify with your dd command?

By default block size is equal to 512 bytes so even smaller than the 4KB you used for FIO and miles away from the 4MB you used for RADOS bench. Be mindful that 5MB/s for your dd with BS=512 is about 10000 IOPS.

JC

> On Oct 4, 2019, at 08:28, Petr Bena <petr@bena.rocks> wrote:
> 
> Hello,
> 
> I tried to use FIO on RBD device I just created and writing is really terrible (around 1.5MB/s)
> 
> [root@ceph3 tmp]# fio test.fio
> rbd_iodepth32: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=32
> fio-3.7
> Starting 1 process
> Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=1628KiB/s][r=0,w=407 IOPS][eta 00m:00s]
> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=115425: Fri Oct  4 17:25:24 2019
>   write: IOPS=384, BW=1538KiB/s (1574kB/s)(39.1MiB/26016msec)
>     slat (nsec): min=1452, max=591931, avg=14498.83, stdev=17295.97
>     clat (usec): min=1795, max=793172, avg=83218.39, stdev=83485.65
>      lat (usec): min=1810, max=793201, avg=83232.89, stdev=83485.19
>     clat percentiles (msec):
>      |  1.00th=[    3],  5.00th=[    5], 10.00th=[    7], 20.00th=[   12],
>      | 30.00th=[   21], 40.00th=[   36], 50.00th=[   61], 60.00th=[   89],
>      | 70.00th=[  116], 80.00th=[  146], 90.00th=[  190], 95.00th=[  218],
>      | 99.00th=[  380], 99.50th=[  430], 99.90th=[  625], 99.95th=[  768],
>      | 99.99th=[  793]
>    bw (  KiB/s): min=  520, max= 4648, per=99.77%, avg=1533.40, stdev=754.35, samples=52
>    iops        : min=  130, max= 1162, avg=383.33, stdev=188.61, samples=52
>   lat (msec)   : 2=0.08%, 4=4.77%, 10=13.56%, 20=11.66%, 50=16.40%
>   lat (msec)   : 100=17.66%, 250=32.53%, 500=3.05%, 750=0.21%, 1000=0.08%
>   cpu          : usr=0.57%, sys=0.52%, ctx=3976, majf=0, minf=8489
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=99.7%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
>      issued rwts: total=0,10000,0,0 short=0,0,0,0 dropped=0,0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=32
> 
> Run status group 0 (all jobs):
>   WRITE: bw=1538KiB/s (1574kB/s), 1538KiB/s-1538KiB/s (1574kB/s-1574kB/s), io=39.1MiB (40.0MB), run=26016-26016msec
> 
> Disk stats (read/write):
>     dm-6: ios=0/2, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=20/368, aggrmerge=0/195, aggrticks=105/6248, aggrin_queue=6353, aggrutil=9.07%
>   xvda: ios=20/368, merge=0/195, ticks=105/6248, in_queue=6353, util=9.07%
> 
> 
> Uncomparably worse to RADOS bench results
> 
> On 04/10/2019 17:15, Alexandre DERUMIER wrote:
>> Hi,
>> 
>>>> dd if=/dev/zero of=/dev/rbd0 writes at 5MB/s -
>> you are testing with a single thread/iodepth=1 sequentially here.
>> Then only 1 disk at time, and you have network latency too.
>> 
>> rados bench is doing 16 concurrent write.
>> 
>> 
>> Try to test with fio for example, with bigger iodepth,  small block/big block , seq/rand.
>> 
>> 
>> 
>> ----- Mail original -----
>> De: "Petr Bena" <petr@bena.rocks>
>> À: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
>> Envoyé: Vendredi 4 Octobre 2019 17:06:48
>> Objet:  Optimizing terrible RBD performance
>> 
>> Hello,
>> 
>> If this is too long for you, TL;DR; section on the bottom
>> 
>> I created a CEPH cluster made of 3 SuperMicro servers, each with 2 OSD
>> (WD RED spinning drives) and I would like to optimize the performance of
>> RBD, which I believe is blocked by some wrong CEPH configuration,
>> because from my observation all resources (CPU, RAM, network, disks) are
>> basically unused / idling even when I put load on the RBD.
>> 
>> Each drive should be 50MB/s read / write and when I run RADOS benchmark,
>> I see values that are somewhat acceptable, interesting part is that when
>> I run RADOS benchmark, I can see all disks read / write to their limits,
>> I can see heavy network utilization and even some CPU utilization - on
>> other hand, when I put any load on the RBD device, performance is
>> terrible, reading is very slow (20MB/s) writing as well (5 - 20MB/s),
>> running dd if=/dev/zero of=/dev/rbd0 writes at 5MB/s - and the most
>> weird part - resources are almost unused - no CPU usage, no network
>> traffic, minimal disk activity.
>> 
>> It looks to me like if CEPH wasn't even trying to perform much as long
>> as the access is via RBD, did anyone ever saw this kind of issue? Is
>> there any way to track down why it is so slow? Here are some outputs:
>> 
>> [root@ceph1 cephadm]# ceph --version
>> ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus
>> (stable)
>> [root@ceph1 cephadm]# ceph health
>> HEALTH_OK
>> 
>> I would expect write speed to be at least the 50MB/s which is speed when
>> writing to disks directly, rados bench does this speed (sometimes even
>> more):
>> 
>> [root@ceph1 cephadm]# rados bench -p testbench 10 write --no-cleanup
>> hints = 1
>> Maintaining 16 concurrent writes of 4194304 bytes to objects of size
>> 4194304 for up to 10 seconds or 0 objects
>> Object prefix: benchmark_data_ceph1.lan.insw.cz_60873
>> sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
>> lat(s)
>> 0 0 0 0 0 0 - 0
>> 1 16 22 6 23.9966 24 0.966194 0.565671
>> 2 16 37 21 41.9945 60 1.86665 0.720606
>> 3 16 54 38 50.6597 68 1.07856 0.797677
>> 4 16 70 54 53.9928 64 1.58914 0.86644
>> 5 16 83 67 53.5924 52 0.208535 0.884525
>> 6 16 97 81 53.9923 56 2.22661 0.932738
>> 7 16 111 95 54.2781 56 1.0294 0.964574
>> 8 16 133 117 58.4921 88 0.883543 1.03648
>> 9 16 143 127 56.4369 40 0.352169 1.00382
>> 10 16 154 138 55.1916 44 0.227044 1.04071
>> 
>> Read speed is even higher as it's probably reading from multiple devices
>> at once:
>> 
>> [root@ceph1 cephadm]# rados bench -p testbench 100 seq
>> hints = 1
>> sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
>> lat(s)
>> 0 0 0 0 0 0 - 0
>> 1 16 96 80 319.934 320 0.811192 0.174081
>> 2 13 161 148 295.952 272 0.606672 0.181417
>> 
>> 
>> Running rbd bench show writes at 50MB/s (which is OK) and reads at
>> 20MB/s (not so OK), but the REAL performance is much worse - when I
>> actually access the block device and try to write or read anything it's
>> sometimes extremely low as in 5MB/s or 20MB/s only.
>> 
>> Why is that? What can I do to debug / trace / optimize this issue? I
>> don't know if there is any point in upgrading the hardware if according
>> to monitoring current HW is basically not being utilized at all.
>> 
>> 
>> TL;DR;
>> 
>> I created a ceph cluster from 6 OSD (dedicated 1G net, 6 4TB spinning
>> drives), the rados performance benchmark shows acceptable performance,
>> but RBD peformance is absolutely terrible (very slow read and very slow
>> write). When I put any kind of load on cluster almost all resources are
>> unused / idling, so this makes me feel like software configuration issue.
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com