Re: RBD fio Performance concerns

Sébastien Han <han.sebastien@xxxxxxxxx> · Mon, 19 Nov 2012 15:56:35 +0100

Hello Mark,

First of all, thank you again for another accurate answer :-).

> I would have expected write aggregation and cylinder affinity to
> have eliminated some seeks and improved rotational latency resulting
> in better than theoretical random write throughput.  Against those
> expectations 763/850 IOPS is not so impressive.  But, it looks to
> me like you were running fio in a 1G file with 100 parallel requests.
> The default RBD stripe width is 4M.  This means that those 100
> parallel requests were being spread across 256 (1G/4M) objects.
> People in the know tell me that writes to a single object are
> serialized, which means that many of those (potentially) parallel
> writes were to the same object, and hence serialized.  This would
> increase the average request time for the colliding operations,
> and reduce the aggregate throughput correspondingly.  Use a
> bigger file (or a narrower stripe) and this will get better.

I followed your advice and used a bigger file (10G) and an iodepth of
128 and I've been able to reach ~27k iops for rand reads but I
couldn't reach more than 870 iops in randwrites... It's kind of
expected. But the thing a still don't understand is: why the
sequential read/writes are lower than the randoms onces? Or maybe do I
just need to care about the bandwidth for those values?

Thank you.

Regards.
--
Bien cordialement.
Sébastien HAN.

On Fri, Nov 16, 2012 at 11:59 PM, Mark Kampe <mark.kampe@xxxxxxxxxxx> wrote:
> On 11/15/2012 12:23 PM, Sébastien Han wrote:
>
>> First of all, I would like to thank you for this well explained,
>> structured and clear answer. I guess I got better IOPS thanks to the 10K
>> disks.
>
>
> 10K RPM would bring your per-drive throughput (for 4K random writes)
> up to 142 IOPS and your aggregate cluster throughput up to 1700.
> This would predict a corresponding RADOSbench throughput somewhere
> above 425 (how much better depending on write aggregation and cylinder
> affinity).  Your RADOSbench 708 now seems even more reasonable.
>
>> To be really honest I wasn't so concerned about the RADOS benchmarks
>> but more about the RBD fio benchmarks and the amont of IOPS that comes
>> out of it, which I found à bit to low.
>
>
> Sticking with 4K random writes, it looks to me like you were running
> fio with libaio (which means direct, no buffer cache).  Because it
> is direct, every I/O operation is really happening and the best
> sustained throughput you should expect from this cluster is
> the aggregate raw fio 4K write throughput (1700 IOPS) divided
> by two copies = 850 random 4K writes per second.  If I read the
> output correctly you got 763 or about 90% of back-of-envelope.
>
> BUT, there are some footnotes (there always are with performance)
>
> If you had been doing buffered I/O you would have seen a lot more
> (up front) benefit from page caching ... but you wouldn't have been
> measuring real (and hence sustainable) I/O throughput ... which is
> ultimately limited by the heads on those twelve disk drives, where
> all of those writes ultimately wind up.  It is easy to be fast
> if you aren't really doing the writes :-)
>
> I would have expected write aggregation and cylinder affinity to
> have eliminated some seeks and improved rotational latency resulting
> in better than theoretical random write throughput.  Against those
> expectations 763/850 IOPS is not so impressive.  But, it looks to
> me like you were running fio in a 1G file with 100 parallel requests.
> The default RBD stripe width is 4M.  This means that those 100
> parallel requests were being spread across 256 (1G/4M) objects.
> People in the know tell me that writes to a single object are
> serialized, which means that many of those (potentially) parallel
> writes were to the same object, and hence serialized.  This would
> increase the average request time for the colliding operations,
> and reduce the aggregate throughput correspondingly.  Use a
> bigger file (or a narrower stripe) and this will get better.
>
> Thus, getting 763 random 4K write IOPs out of those 12 drives
> still sounds about right to me.
>
>
>> On 15 nov. 2012, at 19:43, Mark Kampe <mark.kampe@xxxxxxxxxxx> wrote:
>>
>>> Dear Sebastien,
>>>
>>> Ross Turn forwarded me your e-mail.  You sent a great deal
>>> of information, but it was not immediately obvious to me
>>> what your specific concern was.
>>>
>>> You have 4 servers, 3 OSDs per, 2 copy, and you measured a
>>> radosbench (4K object creation) throughput of 2.9MB/s
>>> (or 708 IOPS).  I infer that you were disappointed by
>>> this number, but it looks right to me.
>>>
>>> Assuming typical 7200 RPM drives, I would guess that each
>>> of them would deliver a sustained direct 4K random write
>>> performance in the general neighborhood of:
>>>     4ms seek (short seeks with write-settle-downs)
>>>     4ms latency (1/2 rotation)
>>>     0ms write (4K/144MB/s ~ 30us)
>>>     -----
>>>     8ms or about 125 IOPS
>>>
>>> Your twelve drives should therefore have a sustainable
>>> aggregate direct 4K random write throughput of 1500 IOPS.
>>>
>>> Each 4K object create involves four writes (two copies,
>>> each getting one data write and one data update).  Thus
>>> I would expect a (crude) 4K create rate of 375 IOPS (1500/4).
>>>
>>> You are getting almost twice the expected raw IOPS ...
>>> and we should expect that a large number of parallel
>>> operations would realize some write/seek aggregation
>>> benefits ... so these numbers look right to me.
>>>
>>> Is this the number you were concerned about, or have I
>>> misunderstood?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html