> If I remember, you use fio with 4MB block size for sequential. > So it's normal that you have less ios, but more bandwith. That's correct for some of the benchmarks. However even with 4K for seq, I still get less IOPS. See below my last fio: # fio rbd-bench.fio seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 fio 1.59 Starting 4 processes Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta 02m:59s] seq-read: (groupid=0, jobs=1): err= 0: pid=15096 read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90 clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63 lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62 bw (KB/s) : min= 0, max=14406, per=31.89%, avg=4258.24, stdev=6239.06 cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued r/w/d: total=200473/0/0, short=0/0/0 lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10% rand-read: (groupid=1, jobs=1): err= 0: pid=16846 read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87 clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40 lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21 bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48, stdev=648.62 cpu : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued r/w/d: total=1632349/0/0, short=0/0/0 lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01% seq-write: (groupid=2, jobs=1): err= 0: pid=18653 write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97 clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19 lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17 bw (KB/s) : min= 7, max= 2165, per=104.03%, avg=764.65, stdev=353.97 cpu : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, >=64=99.4% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued r/w/d: total=0/11171/0, short=0/0/0 lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60% lat (msec): 1000=12.73%, 2000=66.36%, >=2000=13.20% rand-write: (groupid=3, jobs=1): err= 0: pid=20446 write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37 clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27 lat (msec): min=22 , max=5639 , avg=298.52, stdev=430.84 bw (KB/s) : min= 0, max= 7728, per=31.44%, avg=1078.21, stdev=2000.45 cpu : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued r/w/d: total=0/52147/0, short=0/0/0 lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 750=5.10% lat (msec): 1000=2.91%, 2000=5.75%, >=2000=1.33% Run status group 0 (all jobs): READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, maxb=13673KB/s, mint=60053msec, maxt=60053msec Run status group 1 (all jobs): READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s, maxb=111425KB/s, mint=60005msec, maxt=60005msec Run status group 2 (all jobs): WRITE: io=44684KB, aggrb=735KB/s, minb=753KB/s, maxb=753KB/s, mint=60725msec, maxt=60725msec Run status group 3 (all jobs): WRITE: io=208588KB, aggrb=3429KB/s, minb=3511KB/s, maxb=3511KB/s, mint=60822msec, maxt=60822msec Disk stats (read/write): rbd1: ios=1832984/63270, merge=0/0, ticks=16374236/17012132, in_queue=33434120, util=99.79% Cheers! -- Bien cordialement. Sébastien HAN. On Mon, Nov 19, 2012 at 4:28 PM, Alexandre DERUMIER <aderumier@xxxxxxxxx> wrote: >>>why the >>>sequential read/writes are lower than the randoms onces? Or maybe do I >>>just need to care about the bandwidth for those values? > > If I remember, you use fio with 4MB block size for sequential. > So it's normal that you have less ios, but more bandwith. > > > > ----- Mail original ----- > > De: "Sébastien Han" <han.sebastien@xxxxxxxxx> > À: "Mark Kampe" <mark.kampe@xxxxxxxxxxx> > Cc: "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> > Envoyé: Lundi 19 Novembre 2012 15:56:35 > Objet: Re: RBD fio Performance concerns > > Hello Mark, > > First of all, thank you again for another accurate answer :-). > >> I would have expected write aggregation and cylinder affinity to >> have eliminated some seeks and improved rotational latency resulting >> in better than theoretical random write throughput. Against those >> expectations 763/850 IOPS is not so impressive. But, it looks to >> me like you were running fio in a 1G file with 100 parallel requests. >> The default RBD stripe width is 4M. This means that those 100 >> parallel requests were being spread across 256 (1G/4M) objects. >> People in the know tell me that writes to a single object are >> serialized, which means that many of those (potentially) parallel >> writes were to the same object, and hence serialized. This would >> increase the average request time for the colliding operations, >> and reduce the aggregate throughput correspondingly. Use a >> bigger file (or a narrower stripe) and this will get better. > > > I followed your advice and used a bigger file (10G) and an iodepth of > 128 and I've been able to reach ~27k iops for rand reads but I > couldn't reach more than 870 iops in randwrites... It's kind of > expected. But the thing a still don't understand is: why the > sequential read/writes are lower than the randoms onces? Or maybe do I > just need to care about the bandwidth for those values? > > Thank you. > > Regards. > -- > Bien cordialement. > Sébastien HAN. > > > On Fri, Nov 16, 2012 at 11:59 PM, Mark Kampe <mark.kampe@xxxxxxxxxxx> wrote: >> On 11/15/2012 12:23 PM, Sébastien Han wrote: >> >>> First of all, I would like to thank you for this well explained, >>> structured and clear answer. I guess I got better IOPS thanks to the 10K >>> disks. >> >> >> 10K RPM would bring your per-drive throughput (for 4K random writes) >> up to 142 IOPS and your aggregate cluster throughput up to 1700. >> This would predict a corresponding RADOSbench throughput somewhere >> above 425 (how much better depending on write aggregation and cylinder >> affinity). Your RADOSbench 708 now seems even more reasonable. >> >>> To be really honest I wasn't so concerned about the RADOS benchmarks >>> but more about the RBD fio benchmarks and the amont of IOPS that comes >>> out of it, which I found à bit to low. >> >> >> Sticking with 4K random writes, it looks to me like you were running >> fio with libaio (which means direct, no buffer cache). Because it >> is direct, every I/O operation is really happening and the best >> sustained throughput you should expect from this cluster is >> the aggregate raw fio 4K write throughput (1700 IOPS) divided >> by two copies = 850 random 4K writes per second. If I read the >> output correctly you got 763 or about 90% of back-of-envelope. >> >> BUT, there are some footnotes (there always are with performance) >> >> If you had been doing buffered I/O you would have seen a lot more >> (up front) benefit from page caching ... but you wouldn't have been >> measuring real (and hence sustainable) I/O throughput ... which is >> ultimately limited by the heads on those twelve disk drives, where >> all of those writes ultimately wind up. It is easy to be fast >> if you aren't really doing the writes :-) >> >> I would have expected write aggregation and cylinder affinity to >> have eliminated some seeks and improved rotational latency resulting >> in better than theoretical random write throughput. Against those >> expectations 763/850 IOPS is not so impressive. But, it looks to >> me like you were running fio in a 1G file with 100 parallel requests. >> The default RBD stripe width is 4M. This means that those 100 >> parallel requests were being spread across 256 (1G/4M) objects. >> People in the know tell me that writes to a single object are >> serialized, which means that many of those (potentially) parallel >> writes were to the same object, and hence serialized. This would >> increase the average request time for the colliding operations, >> and reduce the aggregate throughput correspondingly. Use a >> bigger file (or a narrower stripe) and this will get better. >> >> Thus, getting 763 random 4K write IOPs out of those 12 drives >> still sounds about right to me. >> >> >>> On 15 nov. 2012, at 19:43, Mark Kampe <mark.kampe@xxxxxxxxxxx> wrote: >>> >>>> Dear Sebastien, >>>> >>>> Ross Turn forwarded me your e-mail. You sent a great deal >>>> of information, but it was not immediately obvious to me >>>> what your specific concern was. >>>> >>>> You have 4 servers, 3 OSDs per, 2 copy, and you measured a >>>> radosbench (4K object creation) throughput of 2.9MB/s >>>> (or 708 IOPS). I infer that you were disappointed by >>>> this number, but it looks right to me. >>>> >>>> Assuming typical 7200 RPM drives, I would guess that each >>>> of them would deliver a sustained direct 4K random write >>>> performance in the general neighborhood of: >>>> 4ms seek (short seeks with write-settle-downs) >>>> 4ms latency (1/2 rotation) >>>> 0ms write (4K/144MB/s ~ 30us) >>>> ----- >>>> 8ms or about 125 IOPS >>>> >>>> Your twelve drives should therefore have a sustainable >>>> aggregate direct 4K random write throughput of 1500 IOPS. >>>> >>>> Each 4K object create involves four writes (two copies, >>>> each getting one data write and one data update). Thus >>>> I would expect a (crude) 4K create rate of 375 IOPS (1500/4). >>>> >>>> You are getting almost twice the expected raw IOPS ... >>>> and we should expect that a large number of parallel >>>> operations would realize some write/seek aggregation >>>> benefits ... so these numbers look right to me. >>>> >>>> Is this the number you were concerned about, or have I >>>> misunderstood? > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html