Why performance of benchmarks with small blocks is extremely small?

chibi@xxxxxxx (Christian Balzer) · Wed, 1 Oct 2014 11:50:19 +0900

Hello,

[reduced to ceph-users]

On Sat, 27 Sep 2014 19:17:22 +0400 Timur Nurlygayanov wrote:

> Hello all,
> 
> I installed OpenStack with Glance + Ceph OSD with replication factor 2
> and now I can see the write operations are extremly slow.
> For example, I can see only 0.04 MB/s write speed when I run rados bench
> with 512b blocks:
> 
> rados bench -p test 60 write --no-cleanup -t 1 -b 512
> 
There are 2 things wrong with that this test:

1. You're using rados bench, when in fact you should be testing from
within VMs. For starters a VM could make use of the rbd cache you enabled,
rados bench won't.

2. Given the parameters of this test you're testing network latency more
than anything else. If you monitor the Ceph nodes (atop is a good tool for
that), you will probably see that neither CPU nor disks resources are
being exhausted. With a single thread rados puts that tiny block of 512
bytes on the wire, the primary OSD for the PG has to write this to the
journal (on your slow, non-SSD disks) and send it to the secondary OSD,
which has to ACK the write to its journal back to the primary one, which
in turn then ACKs it to the client (rados bench) and then rados bench can
send the next packet. 
You get the drift.

Using your parameters I can get 0.17MB/s on a pre-production cluster
that uses 4xQDR Infiniband (IPoIB) connections, on my shitty test cluster
with 1GB/s links I get similar results to you, unsurprisingly. 

Ceph excels only with lots of parallelism, so an individual thread might
be slow (and in your case HAS to be slow, which has nothing to do with
Ceph per se) but many parallel ones will utilize the resources available. 

Having data blocks that are adequately sized (4MB, the default rados size)
will help for bandwidth and the rbd cache inside a properly configured VM
should make that happen.

Of course in most real life scenarios you will run out of IOPS long before
you run out of bandwidth.

>  Maintaining 1 concurrent writes of 512 bytes for up to 60 seconds or 0
> objects
>  Object prefix: benchmark_data_node-17.domain.tld_15862
>    sec Cur ops   started  finished    avg MB/s     cur MB/s       last
> lat          avg lat
>      0       0         0         0              0
> 0                   -                   0
>      1       1        83        82            0.0400341   0.0400391
> 0.008465       0.0120985
>      2       1       169       168          0.0410111    0.0419922
> 0.080433       0.0118995
>      3       1       240       239          0.0388959    0.034668
> 0.008052       0.0125385
>      4       1       356       355          0.0433309   0.0566406
> 0.00837         0.0112662
>      5       1       472       471          0.0459919   0.0566406
> 0.008343       0.0106034
>      6       1       550       549          0.0446735   0.0380859
> 0.036639       0.0108791
>      7       1       581       580          0.0404538   0.0151367
> 0.008614       0.0120654
> 
> 
> My test environment configuration:
> Hardware servers with 1Gb network interfaces, 64Gb RAM and 16 CPU cores
> per node, HDDs WDC WD5003ABYX-01WERA0.
For anything production, consider faster network connections and SSD
journals.

> OpenStack with 1 controller, 1 compute and 2 ceph nodes (ceph on separate
> nodes).
> CentOS 6.5, kernel 2.6.32-431.el6.x86_64.
>
You will probably want a 3.14 or 3.16 kernel for various reasons.

Regards,

Christian

> I tested several config options for optimizations, like in
> /etc/ceph/ceph.conf:
> 
> [default]
> ...
> osd_pool_default_pg_num = 1024
> osd_pool_default_pgp_num = 1024
> osd_pool_default_flag_hashpspool = true
> ...
> [osd]
> osd recovery max active = 1
> osd max backfills = 1
> filestore max sync interval = 30
> filestore min sync interval = 29
> filestore flusher = false
> filestore queue max ops = 10000
> filestore op threads = 16
> osd op threads = 16
> ...
> [client]
> rbd_cache = true
> rbd_cache_writethrough_until_flush = true
> 
> and in /etc/cinder/cinder.conf:
> 
> [DEFAULT]
> volume_tmp_dir=/tmp
> 
> but in the result performance was increased only on ~30 % and it not
> looks like huge success.
> 
> Non-default mount options and TCP optimization increase the speed in
> about 1%:
> 
> [root at node-17 ~]# mount | grep ceph
> /dev/sda4 on /var/lib/ceph/osd/ceph-0 type xfs
> (rw,noexec,nodev,noatime,nodiratime,user_xattr,data=writeback,barrier=0)
> 
> [root at node-17 ~]# cat /etc/sysctl.conf
> net.core.rmem_max = 16777216
> net.core.wmem_max = 16777216
> net.ipv4.tcp_rmem = 4096 87380 16777216
> net.ipv4.tcp_wmem = 4096 65536 16777216
> net.ipv4.tcp_window_scaling = 1
> net.ipv4.tcp_timestamps = 1
> net.ipv4.tcp_sack = 1
> 
> 
> Do we have other ways to significantly improve CEPH storage performance?
> Any feedback and comments are welcome!
> 
> Thank you!
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/