Slow IOPS on RBD compared to journal and backing devices

chibi@xxxxxxx (Christian Balzer) · Thu, 8 May 2014 09:57:46 +0900

Hello,

ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The journals
are on (separate) DC 3700s, the actual OSDs are RAID6 behind an Areca 1882
with 4GB of cache.

Running this fio:

fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128

results in:

  30k  IOPS on the journal SSD (as expected)
 110k  IOPS on the OSD (it fits neatly into the cache, no surprise there)
3200   IOPS from a VM using userspace RBD
2900   IOPS from a host kernelspace mounted RBD

When running the fio from the VM RBD the utilization of the journals is
about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after some
obvious merging).
The OSD processes are quite busy, reading well over 200% on atop, but
the system is not CPU or otherwise resource starved at that moment.

Running multiple instances of this test from several VMs on different hosts
changes nothing, as in the aggregated IOPS for the whole cluster will
still be around 3200 IOPS.

Now clearly RBD has to deal with latency here, but the network is IPoIB
with the associated low latency and the journal SSDs are the
(consistently) fasted ones around. 

I guess what I am wondering about is if this is normal and to be expected
or if not where all that potential performance got lost.

Regards,

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/