Re: fio librbd result is poor

mazhongming <manian1987@xxxxxxx> · Mon, 19 Dec 2016 15:05:05 +0800 (CST)

Hi Christian,
Thanks for your reply.

At 2016-12-19 14:01:57, "Christian Balzer" <chibi@xxxxxxx> wrote:
>
>Hello,
>
>On Mon, 19 Dec 2016 13:29:07 +0800 (CST) 马忠明 wrote:
>
>> Hi guys,
>> 
>> So recently I was testing our ceph cluster which mainly used for block usage(rbd).
>> 
>> We have 30 ssd drives total(5 storage nodes,6 ssd drives each node).However the result of fio is very poor.
>>
>All relevant details are missing.
>SSD exact models, CPU/RAM config, network config, Ceph, OS/kernel, fio
>versions, the config you tested this with, as in replication.
SSD:Intel® SSD DC S3510 Series 1.2TB 2.5"   
CPU:2×Intel E5-2630v4
MEM:128GB
Network config:2*10G bond4  LACP network connection 
Ceph:Hammer 0.94.6
OS/kernel:  Ubuntu 14.04.5 LTS/3.13.0-96-generic
Fio:2.12

>
>> We tested the workload on ssd pool with following parameter :
>> 
>> "fio --size=50G \
>> 
>>        --ioengine=rbd \
>> 
>>        --direct=1 \
>> 
>>        --numjobs=1 \
>> 
>>        --rw=randwrite(randread) \
>> 
>>        --name=com_ssd_4k_randwrite(randread) \
>> 
>>        --bs=4k \
>> 
>>        --iodepth=32 \
>> 
>>        --pool=ssd_volumes \
>> 
>>        --runtime=60 \
>> 
>>        --ramp_time=30 \
>> 
>> --rbdname=4k_test_image"
>> 
>> and here is the result:
>> 
>> random write:4631;random read:21127 
>> 
>> 
>> 
>> 
>> I also tested  the pool(size=1,min_size=1,pg_num=256) which is consisted by  only one single ssd drive with same workload pattern which is more acceptable.(random write:8303;random read:27859)
>> 
>I'm only going to comment on the write part.
>
>On my staging cluster (* see below) I ran your fio against the cache tier
>(so only SSDs involved) with this result:
>
>  write: io=4206.3MB, bw=71784KB/s, iops=17945, runt= 60003msec
>    slat (usec): min=0, max=531, avg= 3.26, stdev=11.33
>    clat (usec): min=5, max=41996, avg=1770.23, stdev=2260.61
>     lat (usec): min=9, max=41997, avg=1773.36, stdev=2260.60
>
>So more than 2 times better than your non-replicated test.
>
>4k randwrites stress the CPUs (run atop or such on your OSD nodes
>when doing a test run), so this might be your limit here.
>Along with less than optimal SSDs or a high latency network.
>
yes...CPU usage might be  the bottleneck of the whole system.BTW,our ceph cluster is combined with mirantis openstack,above result ran from one computer node.And I also ran pressure test with all 10 computer node.The result is almost same and cpu usage for all storage node  is nearly 50-60%.the cpu usage for every ssd osd is nearly 250-300%.

pool parameter for ssd_volomes(size=3,min_size=1,pg_num 2048 pgp_num 2048)

>Christian
>
>
>* Staging cluster:
>---
>4 nodes running latest Hammer under Debian Jessie (with sysvinit, kernel
>4.6) and manually created OSDs. 
>Infiniband (IPoIB) QDR (40Gb/s, about 30Gb/s effective) between all nodes.
>
>2 HDD OSD nodes with 32GB RAM, fast enough CPU (E5-2620 v3), 2x 200GB DC S3610 for
>OS and journals (2 per SSD), 4x 1GB 2.5" SATAs for OSDs.
>For my amusement and edification the OSDs of one node are formatted with
>XFS, the other one EXT4 (as all my production clusters).
>
>The 2 SSD ODS nodes have 1x 200GB DC S3610 (OS and 4 journal partitions)
>and 2x 400GB DC S3610s (2 180GB partitions, so 8 SSD OSDs total), same
>specs as the HDD nodes otherwise.
>Also one node with XFS, the other EXT4.
>
>Pools are size=2, min_size=1, obviously. 
>---
>
>> 
>> 
>> 
>> We have optimized the linux kernal(read_ahead,disk_scheduler,numa,swappiness) and ceph.conf(client_message,filestore_queue,journal_queue,rbd_cache).And checked the raid cache setting.
>> 
>> 
>> 
>> 
>> The only deficiency for the architecture is the unbalance weight between three racks which one rack has only one storage node.
>> 
>> 
>> 
>> 
>> So can anybody tell us whether  this  number is reasonable.If not,any suggestion to improve the number will be appreciated.
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>
>
>-- 
>Christian Balzer        Network/Systems Engineer                
>chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
>http://www.gol.com/

 _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com