Re: fio librbd result is poor

Christian Balzer <chibi@xxxxxxx> · Mon, 19 Dec 2016 16:50:41 +0900

Hello,

On Mon, 19 Dec 2016 15:05:05 +0800 (CST) mazhongming wrote:

> Hi Christian,
> Thanks for your reply.
> 
> 
> At 2016-12-19 14:01:57, "Christian Balzer" <chibi@xxxxxxx> wrote:
> >
> >Hello,
> >
> >On Mon, 19 Dec 2016 13:29:07 +0800 (CST) 马忠明 wrote:
> >
> >> Hi guys,
> >> 
> >> So recently I was testing our ceph cluster which mainly used for block usage(rbd).
> >> 
> >> We have 30 ssd drives total(5 storage nodes,6 ssd drives each node).However the result of fio is very poor.
> >>
> >All relevant details are missing.
> >SSD exact models, CPU/RAM config, network config, Ceph, OS/kernel, fio
> 
> >versions, the config you tested this with, as in replication.
> SSD:Intel® SSD DC S3510 Series 1.2TB 2.5"
Slower than mine, but not massively so and many more of them. 
But your distribution (CRUSH map based on 3 racks, right?) limits that
number advantage.
I'd expect them to be busy around 50-60% busy with the RBD engine fio.

The endurance of 0.3 DPWD (0.1 really after in-line journals and other
overhead like FS journals) would worry me.
Are you monitoring their wear-out levels?

> CPU:2×Intel E5-2630v4
Slightly slower than the ones in my test cluster, but not significantly so.

> MEM:128GB
> Network config:2*10G bond4  LACP network connection 
> Ceph:Hammer 0.94.6
I'd upgrade to the latest Hammer, just in case anybody ever plays with
cache-tiering on there, which is deadly broken in that version.

> OS/kernel:  Ubuntu 14.04.5 LTS/3.13.0-96-generic
That kernel is a bit dated and vastly different than mine, but it
shouldn't be any factor in the result.

> Fio:2.12
> 
Not missing a .1. in there?

Fio 2.1.11 in my case, but I really dislike the RBD engine and the various
bugs/inconsistencies people keep finding with it.

Testing from within a (librbd backed) VM should be more realistic anyway.

And this turns out to be one of these fio RBD engine corner cases, as I did
run your fio command line against an image that was just 20GB in size.

When running from a VM with libaio or with a reduced test size of 5GB 
the IOPS came down to about 8500, still faster then your but only 2x
instead of 4x.

> 
> >
> >> We tested the workload on ssd pool with following parameter :
> >> 
> >> "fio --size=50G \
> >> 
> >>        --ioengine=rbd \
> >> 
> >>        --direct=1 \
> >> 
> >>        --numjobs=1 \
> >> 
> >>        --rw=randwrite(randread) \
> >> 
> >>        --name=com_ssd_4k_randwrite(randread) \
> >> 
> >>        --bs=4k \
> >> 
> >>        --iodepth=32 \
> >> 
> >>        --pool=ssd_volumes \
> >> 
> >>        --runtime=60 \
> >> 
> >>        --ramp_time=30 \
> >> 
> >> --rbdname=4k_test_image"
> >> 
> >> and here is the result:
> >> 
> >> random write:4631;random read:21127 
> >> 
> >> 
> >> 
> >> 
> >> I also tested  the pool(size=1,min_size=1,pg_num=256) which is consisted by  only one single ssd drive with same workload pattern which is more acceptable.(random write:8303;random read:27859)
> >> 
> >I'm only going to comment on the write part.
> >
> >On my staging cluster (* see below) I ran your fio against the cache tier
> >(so only SSDs involved) with this result:
> >
> >  write: io=4206.3MB, bw=71784KB/s, iops=17945, runt= 60003msec
> >    slat (usec): min=0, max=531, avg= 3.26, stdev=11.33
> >    clat (usec): min=5, max=41996, avg=1770.23, stdev=2260.61
> >     lat (usec): min=9, max=41997, avg=1773.36, stdev=2260.60
> >
> >So more than 2 times better than your non-replicated test.
> >
> >4k randwrites stress the CPUs (run atop or such on your OSD nodes
> >when doing a test run), so this might be your limit here.
> >Along with less than optimal SSDs or a high latency network.
> 
> >
> yes...CPU usage might be  the bottleneck of the whole system.BTW,our ceph cluster is combined with mirantis openstack,above result ran from one computer node.And I also ran pressure test with all 10 computer node.The result is almost same and cpu usage for all storage node  is nearly 50-60%.the cpu usage for every ssd osd is nearly 250-300%.
> 

Yes, the OSD 300% CPU usage looks familiar.
The hammer code seems to peter out there, even if there's still a core or
2 available.

The Ceph latency is something that's obviously being addressed by the
developers.
Check the archives and google (Nick Fisk) for how to tune up your CPU
settings to get every last IOPS from your HW.

Another thing to always remember here is that you're testing network
latency as well when running fio with direct=1 against Ceph, the local RBD
cache is bypassed, so you're constrained by how long it takes for the
network round-trips (which is of course significantly longer than a local
SATA cable) and the Ceph latency (code and CPU speeds).

Christian

> 
> pool parameter for ssd_volomes(size=3,min_size=1,pg_num 2048 pgp_num 2048)
> 
> 
> 
> 
> >Christian
> >
> >
> >* Staging cluster:
> >---
> >4 nodes running latest Hammer under Debian Jessie (with sysvinit, kernel
> >4.6) and manually created OSDs. 
> >Infiniband (IPoIB) QDR (40Gb/s, about 30Gb/s effective) between all nodes.
> >
> >2 HDD OSD nodes with 32GB RAM, fast enough CPU (E5-2620 v3), 2x 200GB DC S3610 for
> >OS and journals (2 per SSD), 4x 1GB 2.5" SATAs for OSDs.
> >For my amusement and edification the OSDs of one node are formatted with
> >XFS, the other one EXT4 (as all my production clusters).
> >
> >The 2 SSD ODS nodes have 1x 200GB DC S3610 (OS and 4 journal partitions)
> >and 2x 400GB DC S3610s (2 180GB partitions, so 8 SSD OSDs total), same
> >specs as the HDD nodes otherwise.
> >Also one node with XFS, the other EXT4.
> >
> >Pools are size=2, min_size=1, obviously. 
> >---
> >
> >> 
> >> 
> >> 
> >> We have optimized the linux kernal(read_ahead,disk_scheduler,numa,swappiness) and ceph.conf(client_message,filestore_queue,journal_queue,rbd_cache).And checked the raid cache setting.
> >> 
> >> 
> >> 
> >> 
> >> The only deficiency for the architecture is the unbalance weight between three racks which one rack has only one storage node.
> >> 
> >> 
> >> 
> >> 
> >> So can anybody tell us whether  this  number is reasonable.If not,any suggestion to improve the number will be appreciated.
> >> 
> >> 
> >> 
> >> 
> >>  
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >
> >
> >-- 
> >Christian Balzer        Network/Systems Engineer                
> >chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> >http://www.gol.com/

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com