>>why the >>sequential read/writes are lower than the randoms onces? Or maybe do I >>just need to care about the bandwidth for those values? If I remember, you use fio with 4MB block size for sequential. So it's normal that you have less ios, but more bandwith. ----- Mail original ----- De: "Sébastien Han" <han.sebastien@xxxxxxxxx> À: "Mark Kampe" <mark.kampe@xxxxxxxxxxx> Cc: "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> Envoyé: Lundi 19 Novembre 2012 15:56:35 Objet: Re: RBD fio Performance concerns Hello Mark, First of all, thank you again for another accurate answer :-). > I would have expected write aggregation and cylinder affinity to > have eliminated some seeks and improved rotational latency resulting > in better than theoretical random write throughput. Against those > expectations 763/850 IOPS is not so impressive. But, it looks to > me like you were running fio in a 1G file with 100 parallel requests. > The default RBD stripe width is 4M. This means that those 100 > parallel requests were being spread across 256 (1G/4M) objects. > People in the know tell me that writes to a single object are > serialized, which means that many of those (potentially) parallel > writes were to the same object, and hence serialized. This would > increase the average request time for the colliding operations, > and reduce the aggregate throughput correspondingly. Use a > bigger file (or a narrower stripe) and this will get better. I followed your advice and used a bigger file (10G) and an iodepth of 128 and I've been able to reach ~27k iops for rand reads but I couldn't reach more than 870 iops in randwrites... It's kind of expected. But the thing a still don't understand is: why the sequential read/writes are lower than the randoms onces? Or maybe do I just need to care about the bandwidth for those values? Thank you. Regards. -- Bien cordialement. Sébastien HAN. On Fri, Nov 16, 2012 at 11:59 PM, Mark Kampe <mark.kampe@xxxxxxxxxxx> wrote: > On 11/15/2012 12:23 PM, Sébastien Han wrote: > >> First of all, I would like to thank you for this well explained, >> structured and clear answer. I guess I got better IOPS thanks to the 10K >> disks. > > > 10K RPM would bring your per-drive throughput (for 4K random writes) > up to 142 IOPS and your aggregate cluster throughput up to 1700. > This would predict a corresponding RADOSbench throughput somewhere > above 425 (how much better depending on write aggregation and cylinder > affinity). Your RADOSbench 708 now seems even more reasonable. > >> To be really honest I wasn't so concerned about the RADOS benchmarks >> but more about the RBD fio benchmarks and the amont of IOPS that comes >> out of it, which I found à bit to low. > > > Sticking with 4K random writes, it looks to me like you were running > fio with libaio (which means direct, no buffer cache). Because it > is direct, every I/O operation is really happening and the best > sustained throughput you should expect from this cluster is > the aggregate raw fio 4K write throughput (1700 IOPS) divided > by two copies = 850 random 4K writes per second. If I read the > output correctly you got 763 or about 90% of back-of-envelope. > > BUT, there are some footnotes (there always are with performance) > > If you had been doing buffered I/O you would have seen a lot more > (up front) benefit from page caching ... but you wouldn't have been > measuring real (and hence sustainable) I/O throughput ... which is > ultimately limited by the heads on those twelve disk drives, where > all of those writes ultimately wind up. It is easy to be fast > if you aren't really doing the writes :-) > > I would have expected write aggregation and cylinder affinity to > have eliminated some seeks and improved rotational latency resulting > in better than theoretical random write throughput. Against those > expectations 763/850 IOPS is not so impressive. But, it looks to > me like you were running fio in a 1G file with 100 parallel requests. > The default RBD stripe width is 4M. This means that those 100 > parallel requests were being spread across 256 (1G/4M) objects. > People in the know tell me that writes to a single object are > serialized, which means that many of those (potentially) parallel > writes were to the same object, and hence serialized. This would > increase the average request time for the colliding operations, > and reduce the aggregate throughput correspondingly. Use a > bigger file (or a narrower stripe) and this will get better. > > Thus, getting 763 random 4K write IOPs out of those 12 drives > still sounds about right to me. > > >> On 15 nov. 2012, at 19:43, Mark Kampe <mark.kampe@xxxxxxxxxxx> wrote: >> >>> Dear Sebastien, >>> >>> Ross Turn forwarded me your e-mail. You sent a great deal >>> of information, but it was not immediately obvious to me >>> what your specific concern was. >>> >>> You have 4 servers, 3 OSDs per, 2 copy, and you measured a >>> radosbench (4K object creation) throughput of 2.9MB/s >>> (or 708 IOPS). I infer that you were disappointed by >>> this number, but it looks right to me. >>> >>> Assuming typical 7200 RPM drives, I would guess that each >>> of them would deliver a sustained direct 4K random write >>> performance in the general neighborhood of: >>> 4ms seek (short seeks with write-settle-downs) >>> 4ms latency (1/2 rotation) >>> 0ms write (4K/144MB/s ~ 30us) >>> ----- >>> 8ms or about 125 IOPS >>> >>> Your twelve drives should therefore have a sustainable >>> aggregate direct 4K random write throughput of 1500 IOPS. >>> >>> Each 4K object create involves four writes (two copies, >>> each getting one data write and one data update). Thus >>> I would expect a (crude) 4K create rate of 375 IOPS (1500/4). >>> >>> You are getting almost twice the expected raw IOPS ... >>> and we should expect that a large number of parallel >>> operations would realize some write/seek aggregation >>> benefits ... so these numbers look right to me. >>> >>> Is this the number you were concerned about, or have I >>> misunderstood? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html