Re: The max single write IOPS on single RBD

Zhi Zhang <zhang.david2011@xxxxxxxxx> · Mon, 14 Dec 2015 11:10:58 +0800

On Fri, Dec 11, 2015 at 9:15 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Fri, 11 Dec 2015, Zhi Zhang wrote:
>> Hi Guys,
>>
>> We have a small 4 nodes cluster. Here is the hardware configuration.
>>
>> 11 x 300GB SSD, 24 cores, 32GB memory per one node.
>> all the nodes connected within one 1Gb/s network.
>>
>> So we have one Monitor and 44 OSDs for testing kernel RBD IOPS using
>> fio. Here are the major fio options.
>>
>> -direct=1
>> -rw=randwrite
>> -ioengine=psync
>> -size=1000M
>> -bs=4k
>> -numjobs=1
>>
>> The max IOPS we can achieve for single write (numjobs=1) is close to
>> 1000. This means each IO from RBD takes 1.x ms.
>>
>> >From osd logs, we can also observe most of osd_ops will take 1.x ms,
>> including op processing, journal writing, replication, etc, before
>> sending commit back to client.
>>
>> The network RTT is around 0.04 ms;
>> Most osd_ops on primary OSD take around 0.5~0.7 ms, journal write takes 0.3 ms;
>> Most osd_repops including writing journal on peer OSD take around 0.5 ms.
>>
>> We even tried to modify journal to write page cache only, but didn't
>> get very significant improvement. Does it mean this is the best result
>> we can get for single write on single RBD?
>
> What version is this?  There have been a few recent changes that will
> reduce the wall clock time spent preparing/processing a request.  There is
> still a fair bit of work to do here, though--the theoretical lower bound
> is the SSD write time + 2x RTT (client <-> primary osd <-> replica osd <->
> replica ssd).
>

Ceph version is 0.94.1 with few backports.

I already saw some related changes. I will try a newer version and
keep your guys on the updates.

Thanks.

> sage
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html