Re: Low RBD Performance

Gregory Farnum <greg@xxxxxxxxxxx> · Tue, 4 Feb 2014 09:46:07 -0800

On Tue, Feb 4, 2014 at 9:29 AM, Gruher, Joseph R
<joseph.r.gruher@xxxxxxxxx> wrote:
>
>
>>-----Original Message-----
>>From: ceph-users-bounces@xxxxxxxxxxxxxx [mailto:ceph-users-
>>bounces@xxxxxxxxxxxxxx] On Behalf Of Mark Nelson
>>Sent: Monday, February 03, 2014 6:48 PM
>>To: ceph-users@xxxxxxxxxxxxxx
>>Subject: Re:  Low RBD Performance
>>
>>On 02/03/2014 07:29 PM, Gruher, Joseph R wrote:
>>> Hi folks-
>>>
>>> I'm having trouble demonstrating reasonable performance of RBDs.  I'm
>>> running Ceph 0.72.2 on Ubuntu 13.04 with the 3.12 kernel.  I have four
>>> dual-Xeon servers, each with 24GB RAM, and an Intel 320 SSD for
>>> journals and four WD 10K RPM SAS drives for OSDs, all connected with
>>> an LSI 1078.  This is just a lab experiment using scrounged hardware
>>> so everything isn't sized to be a Ceph cluster, it's just what I have
>>> lying around, but I should have more than enough CPU and memory
>>resources.
>>> Everything is connected with a single 10GbE.
>>>
>>> When testing with RBDs from four clients (also running Ubuntu 13.04
>>> with
>>> 3.12 kernel) I am having trouble breaking 300 IOPS on a 4KB random
>>> read or write workload (cephx set to none, replication set to one).
>>> IO is generated using FIO from four clients, each hosting a single 1TB
>>> RBD, and I've experimented with queue depths and increasing the number
>>> of RBDs without any benefit.  300 IOPS for a pool of 16 10K RPM HDDs
>>> seems quite low, not to mention the journal should provide a good
>>> boost on write workloads.  When I run a 4KB object write workload in
>>> Cosbench I can approach 3500 Obj/Sec which seems more reasonable.
>>>
>>> Sample FIO configuration:
>>>
>>> [global]
>>>
>>> ioengine=libaio
>>>
>>> direct=1
>>>
>>> ramp_time=300
>>>
>>> runtime=300
>>>
>>> [4k-rw]
>>>
>>> description=4k-rw
>>>
>>> filename=/dev/rbd1
>>>
>>> rw=randwrite
>>>
>>> bs=4k
>>>
>>> stonewall
>>>
>>> I use --iodepth=X on the FIO command line to set the queue depth when
>>> testing.
>>>
>>> I notice in the FIO output despite the iodepth setting it seems to be
>>> reporting an IO depth of only 1, which would certainly help explain
>>> poor performance, but I'm at a loss as to why, I wonder if it could be
>>> something specific to RBD behavior, like I need to use a different IO
>>> engine to establish queue depth.
>>>
>>> IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>>=64=0.0%
>>>
>>> Any thoughts appreciated!
>>
>>Interesting results with the io depth at 1.  I Haven't seen that behaviour when
>>using libaio, direct=1, and higher io depths.  Is this kernel RBD or QEMU/KVM?
>>If it's QEMU/KVM, is it the libvirt driver?
>>
>>Certainly 300 IOPS is low for that kind of setup compared to what we've seen
>>for RBD on other systems (especially with 1x replication).  Given that you are
>>seeing more reasonable performance with RGW, I guess I'd look at a couple
>>things:
>>
>>- Figure out why fio is reporting queue depth = 1
>
> Yup, I agree, I will work on this and report back.  First thought is to try specifying the queue depth in the FIO workload file instead of on the command line.
>
>>- Does increasing the num jobs help (ie get concurrency another way)?
>
> I will give this a shot.
>
>>- Do you have enough PGs in the RBD pool?
>
> I should, for 16 OSDs and no replication I use 2048 PGs/PGPs (100 * 16 / 1 rounded up to power of 2).
>
>>- Are you using the virtio driver if QEMU/KVM?
>
> No virtualization, clients are bare metal using kernel RBD.

I believe that directIO via the kernel client will go all the way to
the OSDs and to disk before returning. I imagine that something in the
stack is preventing the dispatch from actually happening
asynchronously in that case, and the reason you're getting 300 IOPS is
because your total RTT is about 3 ms with that code...

Ilya, is that assumption of mine correct? One thing that occurs to me
is that for direct IO it's fair to use the ack instead of on-disk
response from the OSDs, although that would only help us for people
using btrfs.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com