Re: RBD fio Performance concerns

Sébastien Han <han.sebastien@xxxxxxxxx> · Mon, 19 Nov 2012 16:42:15 +0100

> If I remember, you use fio with 4MB block size for sequential.
> So it's normal that you have less ios, but more bandwith.

That's correct for some of the benchmarks. However even with 4K for
seq, I still get less IOPS. See below my last fio:

# fio rbd-bench.fio
seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
fio 1.59
Starting 4 processes
Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99  iops] [eta 02m:59s]
seq-read: (groupid=0, jobs=1): err= 0: pid=15096
  read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec
    slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90
    clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63
     lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62
    bw (KB/s) : min=    0, max=14406, per=31.89%, avg=4258.24, stdev=6239.06
  cpu          : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued r/w/d: total=200473/0/0, short=0/0/0

     lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10%
rand-read: (groupid=1, jobs=1): err= 0: pid=16846
  read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec
    slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87
    clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40
     lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21
    bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48, stdev=648.62
  cpu          : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued r/w/d: total=1632349/0/0, short=0/0/0

     lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01%
seq-write: (groupid=2, jobs=1): err= 0: pid=18653
  write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec
    slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97
    clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19
     lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17
    bw (KB/s) : min=    7, max= 2165, per=104.03%, avg=764.65, stdev=353.97
  cpu          : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, >=64=99.4%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued r/w/d: total=0/11171/0, short=0/0/0

     lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60%
     lat (msec): 1000=12.73%, 2000=66.36%, >=2000=13.20%
rand-write: (groupid=3, jobs=1): err= 0: pid=20446
  write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec
    slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37
    clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27
     lat (msec): min=22 , max=5639 , avg=298.52, stdev=430.84
    bw (KB/s) : min=    0, max= 7728, per=31.44%, avg=1078.21, stdev=2000.45
  cpu          : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued r/w/d: total=0/52147/0, short=0/0/0

     lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 750=5.10%
     lat (msec): 1000=2.91%, 2000=5.75%, >=2000=1.33%

Run status group 0 (all jobs):
   READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, maxb=13673KB/s,
mint=60053msec, maxt=60053msec

Run status group 1 (all jobs):
   READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s,
maxb=111425KB/s, mint=60005msec, maxt=60005msec

Run status group 2 (all jobs):
  WRITE: io=44684KB, aggrb=735KB/s, minb=753KB/s, maxb=753KB/s,
mint=60725msec, maxt=60725msec

Run status group 3 (all jobs):
  WRITE: io=208588KB, aggrb=3429KB/s, minb=3511KB/s, maxb=3511KB/s,
mint=60822msec, maxt=60822msec

Disk stats (read/write):
  rbd1: ios=1832984/63270, merge=0/0, ticks=16374236/17012132,
in_queue=33434120, util=99.79%

Cheers!
--
Bien cordialement.
Sébastien HAN.

On Mon, Nov 19, 2012 at 4:28 PM, Alexandre DERUMIER <aderumier@xxxxxxxxx> wrote:
>>>why the
>>>sequential read/writes are lower than the randoms onces? Or maybe do I
>>>just need to care about the bandwidth for those values?
>
> If I remember, you use fio with 4MB block size for sequential.
> So it's normal that you have less ios, but more bandwith.
>
>
>
> ----- Mail original -----
>
> De: "Sébastien Han" <han.sebastien@xxxxxxxxx>
> À: "Mark Kampe" <mark.kampe@xxxxxxxxxxx>
> Cc: "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx>
> Envoyé: Lundi 19 Novembre 2012 15:56:35
> Objet: Re: RBD fio Performance concerns
>
> Hello Mark,
>
> First of all, thank you again for another accurate answer :-).
>
>> I would have expected write aggregation and cylinder affinity to
>> have eliminated some seeks and improved rotational latency resulting
>> in better than theoretical random write throughput. Against those
>> expectations 763/850 IOPS is not so impressive. But, it looks to
>> me like you were running fio in a 1G file with 100 parallel requests.
>> The default RBD stripe width is 4M. This means that those 100
>> parallel requests were being spread across 256 (1G/4M) objects.
>> People in the know tell me that writes to a single object are
>> serialized, which means that many of those (potentially) parallel
>> writes were to the same object, and hence serialized. This would
>> increase the average request time for the colliding operations,
>> and reduce the aggregate throughput correspondingly. Use a
>> bigger file (or a narrower stripe) and this will get better.
>
>
> I followed your advice and used a bigger file (10G) and an iodepth of
> 128 and I've been able to reach ~27k iops for rand reads but I
> couldn't reach more than 870 iops in randwrites... It's kind of
> expected. But the thing a still don't understand is: why the
> sequential read/writes are lower than the randoms onces? Or maybe do I
> just need to care about the bandwidth for those values?
>
> Thank you.
>
> Regards.
> --
> Bien cordialement.
> Sébastien HAN.
>
>
> On Fri, Nov 16, 2012 at 11:59 PM, Mark Kampe <mark.kampe@xxxxxxxxxxx> wrote:
>> On 11/15/2012 12:23 PM, Sébastien Han wrote:
>>
>>> First of all, I would like to thank you for this well explained,
>>> structured and clear answer. I guess I got better IOPS thanks to the 10K
>>> disks.
>>
>>
>> 10K RPM would bring your per-drive throughput (for 4K random writes)
>> up to 142 IOPS and your aggregate cluster throughput up to 1700.
>> This would predict a corresponding RADOSbench throughput somewhere
>> above 425 (how much better depending on write aggregation and cylinder
>> affinity). Your RADOSbench 708 now seems even more reasonable.
>>
>>> To be really honest I wasn't so concerned about the RADOS benchmarks
>>> but more about the RBD fio benchmarks and the amont of IOPS that comes
>>> out of it, which I found à bit to low.
>>
>>
>> Sticking with 4K random writes, it looks to me like you were running
>> fio with libaio (which means direct, no buffer cache). Because it
>> is direct, every I/O operation is really happening and the best
>> sustained throughput you should expect from this cluster is
>> the aggregate raw fio 4K write throughput (1700 IOPS) divided
>> by two copies = 850 random 4K writes per second. If I read the
>> output correctly you got 763 or about 90% of back-of-envelope.
>>
>> BUT, there are some footnotes (there always are with performance)
>>
>> If you had been doing buffered I/O you would have seen a lot more
>> (up front) benefit from page caching ... but you wouldn't have been
>> measuring real (and hence sustainable) I/O throughput ... which is
>> ultimately limited by the heads on those twelve disk drives, where
>> all of those writes ultimately wind up. It is easy to be fast
>> if you aren't really doing the writes :-)
>>
>> I would have expected write aggregation and cylinder affinity to
>> have eliminated some seeks and improved rotational latency resulting
>> in better than theoretical random write throughput. Against those
>> expectations 763/850 IOPS is not so impressive. But, it looks to
>> me like you were running fio in a 1G file with 100 parallel requests.
>> The default RBD stripe width is 4M. This means that those 100
>> parallel requests were being spread across 256 (1G/4M) objects.
>> People in the know tell me that writes to a single object are
>> serialized, which means that many of those (potentially) parallel
>> writes were to the same object, and hence serialized. This would
>> increase the average request time for the colliding operations,
>> and reduce the aggregate throughput correspondingly. Use a
>> bigger file (or a narrower stripe) and this will get better.
>>
>> Thus, getting 763 random 4K write IOPs out of those 12 drives
>> still sounds about right to me.
>>
>>
>>> On 15 nov. 2012, at 19:43, Mark Kampe <mark.kampe@xxxxxxxxxxx> wrote:
>>>
>>>> Dear Sebastien,
>>>>
>>>> Ross Turn forwarded me your e-mail. You sent a great deal
>>>> of information, but it was not immediately obvious to me
>>>> what your specific concern was.
>>>>
>>>> You have 4 servers, 3 OSDs per, 2 copy, and you measured a
>>>> radosbench (4K object creation) throughput of 2.9MB/s
>>>> (or 708 IOPS). I infer that you were disappointed by
>>>> this number, but it looks right to me.
>>>>
>>>> Assuming typical 7200 RPM drives, I would guess that each
>>>> of them would deliver a sustained direct 4K random write
>>>> performance in the general neighborhood of:
>>>> 4ms seek (short seeks with write-settle-downs)
>>>> 4ms latency (1/2 rotation)
>>>> 0ms write (4K/144MB/s ~ 30us)
>>>> -----
>>>> 8ms or about 125 IOPS
>>>>
>>>> Your twelve drives should therefore have a sustainable
>>>> aggregate direct 4K random write throughput of 1500 IOPS.
>>>>
>>>> Each 4K object create involves four writes (two copies,
>>>> each getting one data write and one data update). Thus
>>>> I would expect a (crude) 4K create rate of 375 IOPS (1500/4).
>>>>
>>>> You are getting almost twice the expected raw IOPS ...
>>>> and we should expect that a large number of parallel
>>>> operations would realize some write/seek aggregation
>>>> benefits ... so these numbers look right to me.
>>>>
>>>> Is this the number you were concerned about, or have I
>>>> misunderstood?
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html