Re: RBD fio Performance concerns

Alexandre DERUMIER <aderumier@xxxxxxxxx> · Mon, 19 Nov 2012 16:28:21 +0100 (CET)

>>why the
>>sequential read/writes are lower than the randoms onces? Or maybe do I
>>just need to care about the bandwidth for those values?

If I remember, you use fio with 4MB block size for sequential.
So it's normal that you have less ios, but more bandwith.

----- Mail original ----- 

De: "Sébastien Han" <han.sebastien@xxxxxxxxx> 
À: "Mark Kampe" <mark.kampe@xxxxxxxxxxx> 
Cc: "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> 
Envoyé: Lundi 19 Novembre 2012 15:56:35 
Objet: Re: RBD fio Performance concerns 

Hello Mark, 

First of all, thank you again for another accurate answer :-). 

> I would have expected write aggregation and cylinder affinity to 
> have eliminated some seeks and improved rotational latency resulting 
> in better than theoretical random write throughput. Against those 
> expectations 763/850 IOPS is not so impressive. But, it looks to 
> me like you were running fio in a 1G file with 100 parallel requests. 
> The default RBD stripe width is 4M. This means that those 100 
> parallel requests were being spread across 256 (1G/4M) objects. 
> People in the know tell me that writes to a single object are 
> serialized, which means that many of those (potentially) parallel 
> writes were to the same object, and hence serialized. This would 
> increase the average request time for the colliding operations, 
> and reduce the aggregate throughput correspondingly. Use a 
> bigger file (or a narrower stripe) and this will get better. 

I followed your advice and used a bigger file (10G) and an iodepth of 
128 and I've been able to reach ~27k iops for rand reads but I 
couldn't reach more than 870 iops in randwrites... It's kind of 
expected. But the thing a still don't understand is: why the 
sequential read/writes are lower than the randoms onces? Or maybe do I 
just need to care about the bandwidth for those values? 

Thank you. 

Regards. 
-- 
Bien cordialement. 
Sébastien HAN. 

On Fri, Nov 16, 2012 at 11:59 PM, Mark Kampe <mark.kampe@xxxxxxxxxxx> wrote: 
> On 11/15/2012 12:23 PM, Sébastien Han wrote: 
> 
>> First of all, I would like to thank you for this well explained, 
>> structured and clear answer. I guess I got better IOPS thanks to the 10K 
>> disks. 
> 
> 
> 10K RPM would bring your per-drive throughput (for 4K random writes) 
> up to 142 IOPS and your aggregate cluster throughput up to 1700. 
> This would predict a corresponding RADOSbench throughput somewhere 
> above 425 (how much better depending on write aggregation and cylinder 
> affinity). Your RADOSbench 708 now seems even more reasonable. 
> 
>> To be really honest I wasn't so concerned about the RADOS benchmarks 
>> but more about the RBD fio benchmarks and the amont of IOPS that comes 
>> out of it, which I found à bit to low. 
> 
> 
> Sticking with 4K random writes, it looks to me like you were running 
> fio with libaio (which means direct, no buffer cache). Because it 
> is direct, every I/O operation is really happening and the best 
> sustained throughput you should expect from this cluster is 
> the aggregate raw fio 4K write throughput (1700 IOPS) divided 
> by two copies = 850 random 4K writes per second. If I read the 
> output correctly you got 763 or about 90% of back-of-envelope. 
> 
> BUT, there are some footnotes (there always are with performance) 
> 
> If you had been doing buffered I/O you would have seen a lot more 
> (up front) benefit from page caching ... but you wouldn't have been 
> measuring real (and hence sustainable) I/O throughput ... which is 
> ultimately limited by the heads on those twelve disk drives, where 
> all of those writes ultimately wind up. It is easy to be fast 
> if you aren't really doing the writes :-) 
> 
> I would have expected write aggregation and cylinder affinity to 
> have eliminated some seeks and improved rotational latency resulting 
> in better than theoretical random write throughput. Against those 
> expectations 763/850 IOPS is not so impressive. But, it looks to 
> me like you were running fio in a 1G file with 100 parallel requests. 
> The default RBD stripe width is 4M. This means that those 100 
> parallel requests were being spread across 256 (1G/4M) objects. 
> People in the know tell me that writes to a single object are 
> serialized, which means that many of those (potentially) parallel 
> writes were to the same object, and hence serialized. This would 
> increase the average request time for the colliding operations, 
> and reduce the aggregate throughput correspondingly. Use a 
> bigger file (or a narrower stripe) and this will get better. 
> 
> Thus, getting 763 random 4K write IOPs out of those 12 drives 
> still sounds about right to me. 
> 
> 
>> On 15 nov. 2012, at 19:43, Mark Kampe <mark.kampe@xxxxxxxxxxx> wrote: 
>> 
>>> Dear Sebastien, 
>>> 
>>> Ross Turn forwarded me your e-mail. You sent a great deal 
>>> of information, but it was not immediately obvious to me 
>>> what your specific concern was. 
>>> 
>>> You have 4 servers, 3 OSDs per, 2 copy, and you measured a 
>>> radosbench (4K object creation) throughput of 2.9MB/s 
>>> (or 708 IOPS). I infer that you were disappointed by 
>>> this number, but it looks right to me. 
>>> 
>>> Assuming typical 7200 RPM drives, I would guess that each 
>>> of them would deliver a sustained direct 4K random write 
>>> performance in the general neighborhood of: 
>>> 4ms seek (short seeks with write-settle-downs) 
>>> 4ms latency (1/2 rotation) 
>>> 0ms write (4K/144MB/s ~ 30us) 
>>> ----- 
>>> 8ms or about 125 IOPS 
>>> 
>>> Your twelve drives should therefore have a sustainable 
>>> aggregate direct 4K random write throughput of 1500 IOPS. 
>>> 
>>> Each 4K object create involves four writes (two copies, 
>>> each getting one data write and one data update). Thus 
>>> I would expect a (crude) 4K create rate of 375 IOPS (1500/4). 
>>> 
>>> You are getting almost twice the expected raw IOPS ... 
>>> and we should expect that a large number of parallel 
>>> operations would realize some write/seek aggregation 
>>> benefits ... so these numbers look right to me. 
>>> 
>>> Is this the number you were concerned about, or have I 
>>> misunderstood? 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@xxxxxxxxxxxxxxx 
More majordomo info at http://vger.kernel.org/majordomo-info.html 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html