Re: Optimizing terrible RBD performance

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Fri, 4 Oct 2019 23:03:44 +0200

The 4M throughput numbers you see now ( 150 MB/s read, 60 MB/s write) 
are probably limited by your 1G network, and can probably go higher if 
you increase it ( 10G or use active bonds).

In real life, the applications and wokloads determine the block size, io 
depths, whether it is sequential or random, whether it uses cache 
buffering or requests to to bypass the cache. Very few applications 
(such as backup) allow you to specify some such settings.

So what you could do is understand your workload : is it backups that 
uses sequential large block or is it virtualization or databases that 
require high iops with small block sizes. Then use a tool like fio to 
see what your hardware is able to provide under different 
configurations,  it is also a  good idea to run a load collection tool 
like atop/sar/collectl during such tests to know what are your 
bottlenecks to help you change configurations ( adding osds, nodes, 
different disk types, network configuration ).

For example if your workload is backups, and need to go above 150 MB/s, 
first step is bump up your 1G network, then if you still need higher 
throughput, you may add more osds then nodes ..etc.

If your workload requires like 50k random iops, you will not be able to 
achieve this with hdds

/Maged

On 04/10/2019 21:00, Petr Bena wrote:
Thank you guys,

I changed FIO parameters and it looks far better now - reading about 
150MB/s, writing over 60MB/s

Now, the question is, what could I change in my setup to make it this 
fast - the RBD is used as LVM PV for a VG shared between Xen 
hypervisors, this is the PV:

  --- Physical volume ---
  PV Name               /dev/rbd0
  VG Name VG_XenStorage-275588a7-4895-9073-aa81-61a3d98dfba7
  PV Size               4.00 TiB / not usable 0
  Allocatable           yes
  PE Size               4.00 MiB
  Total PE              1048573
  Free PE               740048
  Allocated PE          308525
  PV UUID               IieC3P-2dw4-Zotx-ZG8v-TKV0-WBBP-5YQF4P

Physical extent size is 4MB, but not sure if that really means 
anything in this sense, I am not sure if LVM subsystem on Linux can be 
tweaked in how large block that is being read / written is? Is there 
anything I can do to improve the performance, except for replacing 
with SSD disks? Does it mean that IOPS is my bottleneck now?

On 04/10/2019 18:53, Maged Mokhtar wrote:
The tests are measuring differing things, and fio test result of 1.5 
MB/s is not bad.

The rados write bench uses by default 4M block size and does 16 
threads and is random in nature, you can change the block size and 
thread count.

The dd command uses by default 512 block size and and 1 thread and is 
sequential in nature. You can change the block size via bs to 4M and 
it will give high results, it will also use buffered io unless you 
make it non buffered (oflag=direct).

with fio you have full control on block size, threads, rand/seq, 
buffered, direct, sync..etc. The fio test you are running uses 32 
queue depths / threads, 4k random write. To compare with rados, 
change the block size to 4M and make it sequential.

The 1.58 MB/s is not bad for the test. At 4k this is 400 iops, if you 
are doing standard 3x replias, your cluster is doing 1200 iops and 
this is just for client data, it does have other overhead like metada 
db lookups/updates so it is actually doing more, but even 1200 random 
iops for 6 spinning disk gives 200 random iops per disk which is 
acceptable.

/Maged

On 04/10/2019 17:28, Petr Bena wrote:
Hello,

I tried to use FIO on RBD device I just created and writing is 
really terrible (around 1.5MB/s)

[root@ceph3 tmp]# fio test.fio
rbd_iodepth32: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 
4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=32
fio-3.7
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=1628KiB/s][r=0,w=407 
IOPS][eta 00m:00s]
rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=115425: Fri Oct 4 
17:25:24 2019
  write: IOPS=384, BW=1538KiB/s (1574kB/s)(39.1MiB/26016msec)
    slat (nsec): min=1452, max=591931, avg=14498.83, stdev=17295.97
    clat (usec): min=1795, max=793172, avg=83218.39, stdev=83485.65
     lat (usec): min=1810, max=793201, avg=83232.89, stdev=83485.19
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    5], 10.00th=[    7], 
20.00th=[   12],
     | 30.00th=[   21], 40.00th=[   36], 50.00th=[   61], 
60.00th=[   89],
     | 70.00th=[  116], 80.00th=[  146], 90.00th=[  190], 95.00th=[  
218],
     | 99.00th=[  380], 99.50th=[  430], 99.90th=[  625], 99.95th=[  
768],
     | 99.99th=[  793]
   bw (  KiB/s): min=  520, max= 4648, per=99.77%, avg=1533.40, 
stdev=754.35, samples=52
   iops        : min=  130, max= 1162, avg=383.33, stdev=188.61, 
samples=52
  lat (msec)   : 2=0.08%, 4=4.77%, 10=13.56%, 20=11.66%, 50=16.40%
  lat (msec)   : 100=17.66%, 250=32.53%, 500=3.05%, 750=0.21%, 
1000=0.08%
  cpu          : usr=0.57%, sys=0.52%, ctx=3976, majf=0, minf=8489
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=99.7%, 
>=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 
64=0.0%, >=64=0.0%
     issued rwts: total=0,10000,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=1538KiB/s (1574kB/s), 1538KiB/s-1538KiB/s 
(1574kB/s-1574kB/s), io=39.1MiB (40.0MB), run=26016-26016msec

Disk stats (read/write):
    dm-6: ios=0/2, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, 
aggrios=20/368, aggrmerge=0/195, aggrticks=105/6248, 
aggrin_queue=6353, aggrutil=9.07%
  xvda: ios=20/368, merge=0/195, ticks=105/6248, in_queue=6353, 
util=9.07%

Uncomparably worse to RADOS bench results

On 04/10/2019 17:15, Alexandre DERUMIER wrote:
Hi,

dd if=/dev/zero of=/dev/rbd0 writes at 5MB/s -
you are testing with a single thread/iodepth=1 sequentially here.
Then only 1 disk at time, and you have network latency too.

rados bench is doing 16 concurrent write.

Try to test with fio for example, with bigger iodepth, small 
block/big block , seq/rand.

----- Mail original -----
De: "Petr Bena" <petr@bena.rocks>
À: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
Envoyé: Vendredi 4 Octobre 2019 17:06:48
Objet:  Optimizing terrible RBD performance

Hello,

If this is too long for you, TL;DR; section on the bottom

I created a CEPH cluster made of 3 SuperMicro servers, each with 2 OSD
(WD RED spinning drives) and I would like to optimize the 
performance of
RBD, which I believe is blocked by some wrong CEPH configuration,
because from my observation all resources (CPU, RAM, network, 
disks) are
basically unused / idling even when I put load on the RBD.

Each drive should be 50MB/s read / write and when I run RADOS 
benchmark,
I see values that are somewhat acceptable, interesting part is that 
when
I run RADOS benchmark, I can see all disks read / write to their 
limits,
I can see heavy network utilization and even some CPU utilization - on
other hand, when I put any load on the RBD device, performance is
terrible, reading is very slow (20MB/s) writing as well (5 - 20MB/s),
running dd if=/dev/zero of=/dev/rbd0 writes at 5MB/s - and the most
weird part - resources are almost unused - no CPU usage, no network
traffic, minimal disk activity.

It looks to me like if CEPH wasn't even trying to perform much as long
as the access is via RBD, did anyone ever saw this kind of issue? Is
there any way to track down why it is so slow? Here are some outputs:

[root@ceph1 cephadm]# ceph --version
ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) 
nautilus
(stable)
[root@ceph1 cephadm]# ceph health
HEALTH_OK

I would expect write speed to be at least the 50MB/s which is speed 
when
writing to disks directly, rados bench does this speed (sometimes even
more):

[root@ceph1 cephadm]# rados bench -p testbench 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size
4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_ceph1.lan.insw.cz_60873
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
0 0 0 0 0 0 - 0
1 16 22 6 23.9966 24 0.966194 0.565671
2 16 37 21 41.9945 60 1.86665 0.720606
3 16 54 38 50.6597 68 1.07856 0.797677
4 16 70 54 53.9928 64 1.58914 0.86644
5 16 83 67 53.5924 52 0.208535 0.884525
6 16 97 81 53.9923 56 2.22661 0.932738
7 16 111 95 54.2781 56 1.0294 0.964574
8 16 133 117 58.4921 88 0.883543 1.03648
9 16 143 127 56.4369 40 0.352169 1.00382
10 16 154 138 55.1916 44 0.227044 1.04071

Read speed is even higher as it's probably reading from multiple 
devices
at once:

[root@ceph1 cephadm]# rados bench -p testbench 100 seq
hints = 1
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
0 0 0 0 0 0 - 0
1 16 96 80 319.934 320 0.811192 0.174081
2 13 161 148 295.952 272 0.606672 0.181417

Running rbd bench show writes at 50MB/s (which is OK) and reads at
20MB/s (not so OK), but the REAL performance is much worse - when I
actually access the block device and try to write or read anything 
it's
sometimes extremely low as in 5MB/s or 20MB/s only.

Why is that? What can I do to debug / trace / optimize this issue? I
don't know if there is any point in upgrading the hardware if 
according
to monitoring current HW is basically not being utilized at all.

TL;DR;

I created a ceph cluster from 6 OSD (dedicated 1G net, 6 4TB spinning
drives), the rados performance benchmark shows acceptable performance,
but RBD peformance is absolutely terrible (very slow read and very 
slow
write). When I put any kind of load on cluster almost all resources 
are
unused / idling, so this makes me feel like software configuration 
issue.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com