Re: fio librbd result is poor

Christian Balzer <chibi@xxxxxxx> · Mon, 19 Dec 2016 15:01:57 +0900

Hello,

On Mon, 19 Dec 2016 13:29:07 +0800 (CST) 马忠明 wrote:

> Hi guys,
> 
> So recently I was testing our ceph cluster which mainly used for block usage(rbd).
> 
> We have 30 ssd drives total(5 storage nodes,6 ssd drives each node).However the result of fio is very poor.
>
All relevant details are missing.
SSD exact models, CPU/RAM config, network config, Ceph, OS/kernel, fio
versions, the config you tested this with, as in replication.

> We tested the workload on ssd pool with following parameter :
> 
> "fio --size=50G \
> 
>        --ioengine=rbd \
> 
>        --direct=1 \
> 
>        --numjobs=1 \
> 
>        --rw=randwrite(randread) \
> 
>        --name=com_ssd_4k_randwrite(randread) \
> 
>        --bs=4k \
> 
>        --iodepth=32 \
> 
>        --pool=ssd_volumes \
> 
>        --runtime=60 \
> 
>        --ramp_time=30 \
> 
> --rbdname=4k_test_image"
> 
> and here is the result:
> 
> random write:4631;random read:21127 
> 
> 
> 
> 
> I also tested  the pool(size=1,min_size=1,pg_num=256) which is consisted by  only one single ssd drive with same workload pattern which is more acceptable.(random write:8303;random read:27859)
> 
I'm only going to comment on the write part.

On my staging cluster (* see below) I ran your fio against the cache tier
(so only SSDs involved) with this result:

  write: io=4206.3MB, bw=71784KB/s, iops=17945, runt= 60003msec
    slat (usec): min=0, max=531, avg= 3.26, stdev=11.33
    clat (usec): min=5, max=41996, avg=1770.23, stdev=2260.61
     lat (usec): min=9, max=41997, avg=1773.36, stdev=2260.60

So more than 2 times better than your non-replicated test.

4k randwrites stress the CPUs (run atop or such on your OSD nodes
when doing a test run), so this might be your limit here.
Along with less than optimal SSDs or a high latency network.

Christian

* Staging cluster:
---
4 nodes running latest Hammer under Debian Jessie (with sysvinit, kernel
4.6) and manually created OSDs. 
Infiniband (IPoIB) QDR (40Gb/s, about 30Gb/s effective) between all nodes.

2 HDD OSD nodes with 32GB RAM, fast enough CPU (E5-2620 v3), 2x 200GB DC S3610 for
OS and journals (2 per SSD), 4x 1GB 2.5" SATAs for OSDs.
For my amusement and edification the OSDs of one node are formatted with
XFS, the other one EXT4 (as all my production clusters).

The 2 SSD ODS nodes have 1x 200GB DC S3610 (OS and 4 journal partitions)
and 2x 400GB DC S3610s (2 180GB partitions, so 8 SSD OSDs total), same
specs as the HDD nodes otherwise.
Also one node with XFS, the other EXT4.

Pools are size=2, min_size=1, obviously. 
---

> 
> 
> 
> We have optimized the linux kernal(read_ahead,disk_scheduler,numa,swappiness) and ceph.conf(client_message,filestore_queue,journal_queue,rbd_cache).And checked the raid cache setting.
> 
> 
> 
> 
> The only deficiency for the architecture is the unbalance weight between three racks which one rack has only one storage node.
> 
> 
> 
> 
> So can anybody tell us whether  this  number is reasonable.If not,any suggestion to improve the number will be appreciated.
> 
> 
> 
> 
>  
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com