Re: RBD fio Performance concerns

Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx> · Thu, 22 Nov 2012 11:19:15 +0100

Same to me:
rand 4k: 23.000 iops
seq 4k: 13.000 iops

Even in writeback mode where normally seq 4k should be merged into 
bigger requests.

Stefan

Am 21.11.2012 17:34, schrieb Mark Nelson:
Responding to my own message. :)

Talked to Sage a bit offline about this.  I think there are two opposing
forces:

On one hand, random IO may be spreading reads/writes out across more
OSDs than sequential IO that presumably would be hitting a single OSD
more regularly.

On the other hand, you'd expect that sequential writes would be getting
coalesced either at the RBD layer or on the OSD, and that the
drive/controller/filesystem underneath the OSD would be doing some kind
of readahead or prefetching.

On the third hand, maybe coalescing/prefetching is in fact happening but
we are IOP limited by some per-osd limitation.

It could be interesting to do the test with a single OSD and see what
happens.

Mark

On 11/21/2012 09:52 AM, Mark Nelson wrote:
Hi Guys,

I'm late to this thread but thought I'd chime in.  Crazy that you are
getting higher performance with random reads/writes vs sequential!  It
would be interesting to see what kind of throughput smalliobench reports
(should be packaged in bobtail) and also see if this behavior happens
with cephfs.  It's still too early in the morning for me right now to
come up with a reasonable explanation for what's going on.  It might be
worth running blktrace and seekwatcher to see what the io patterns on
the underlying disk look like in each case.  Maybe something unexpected
is going on.

Mark

On 11/19/2012 02:57 PM, Sébastien Han wrote:
Which iodepth did you use for those benchs?

I really don't understand why I can't get more rand read iops with 4K
block ...

Me neither, hope to get some clarification from the Inktank guys. It
doesn't make any sense to me...
--
Bien cordialement.
Sébastien HAN.

On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER
<aderumier@xxxxxxxxx> wrote:
@Alexandre: is it the same for you? or do you always get more IOPS
with seq?

rand read 4K : 6000 iops
seq read 4K : 3500 iops
seq read 4M : 31iops (1gigabit client bandwith limit)

rand write 4k: 6000iops  (tmpfs journal)
seq write 4k: 1600iops
seq write 4M : 31iops (1gigabit client bandwith limit)

I really don't understand why I can't get more rand read iops with 4K
block ...

I try with high end cpu for client, it doesn't change nothing.
But test cluster use  old 8 cores E5420  @ 2.50GHZ (But cpu is around
15% on cluster during read bench)

----- Mail original -----

De: "Sébastien Han" <han.sebastien@xxxxxxxxx>
À: "Mark Kampe" <mark.kampe@xxxxxxxxxxx>
Cc: "Alexandre DERUMIER" <aderumier@xxxxxxxxx>, "ceph-devel"
<ceph-devel@xxxxxxxxxxxxxxx>
Envoyé: Lundi 19 Novembre 2012 19:03:40
Objet: Re: RBD fio Performance concerns

@Sage, thanks for the info :)
@Mark:

If you want to do sequential I/O, you should do it buffered
(so that the writes can be aggregated) or with a 4M block size
(very efficient and avoiding object serialization).

The original benchmark has been performed with 4M block size. And as
you can see I still get more IOPS with rand than seq... I just tried
with 4M without direct I/O, still the same. I can print fio results if
it's needed.

We do direct writes for benchmarking, not because it is a reasonable
way to do I/O, but because it bypasses the buffer cache and enables
us to directly measure cluster I/O throughput (which is what we are
trying to optimize). Applications should usually do buffered I/O,
to get the (very significant) benefits of caching and write
aggregation.

I know why I use direct I/O. It's synthetic benchmarks, it's far away
from a real life scenario and how common applications works. I just
try to see the maximum I/O throughput that I can get from my RBD. All
my applications use buffered I/O.

@Alexandre: is it the same for you? or do you always get more IOPS
with seq?

Thanks to all of you..

On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe <mark.kampe@xxxxxxxxxxx>
wrote:
Recall:
1. RBD volumes are striped (4M wide) across RADOS objects
2. distinct writes to a single RADOS object are serialized

Your sequential 4K writes are direct, depth=256, so there are
(at all times) 256 writes queued to the same object. All of
your writes are waiting through a very long line, which is adding
horrendous latency.

If you want to do sequential I/O, you should do it buffered
(so that the writes can be aggregated) or with a 4M block size
(very efficient and avoiding object serialization).

We do direct writes for benchmarking, not because it is a reasonable
way to do I/O, but because it bypasses the buffer cache and enables
us to directly measure cluster I/O throughput (which is what we are
trying to optimize). Applications should usually do buffered I/O,
to get the (very significant) benefits of caching and write
aggregation.

That's correct for some of the benchmarks. However even with 4K for
seq, I still get less IOPS. See below my last fio:

# fio rbd-bench.fio
seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio,
iodepth=256
rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio,
iodepth=256
seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio,
iodepth=256
rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio,
iodepth=256
fio 1.59
Starting 4 processes
Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta
02m:59s]
seq-read: (groupid=0, jobs=1): err= 0: pid=15096
read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec
slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90
clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63
lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62
bw (KB/s) : min= 0, max=14406, per=31.89%, avg=4258.24,
stdev=6239.06
cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
=64=0.1%
issued r/w/d: total=200473/0/0, short=0/0/0

lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10%
rand-read: (groupid=1, jobs=1): err= 0: pid=16846
read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec
slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87
clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40
lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21
bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48,
stdev=648.62
cpu : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
=64=0.1%
issued r/w/d: total=1632349/0/0, short=0/0/0

lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01%
seq-write: (groupid=2, jobs=1): err= 0: pid=18653
write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec
slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97
clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19
lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17
bw (KB/s) : min= 7, max= 2165, per=104.03%, avg=764.65,
stdev=353.97
cpu : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%,
=64=99.4%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
=64=0.1%
issued r/w/d: total=0/11171/0, short=0/0/0

lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60%
lat (msec): 1000=12.73%, 2000=66.36%, >=2000=13.20%
rand-write: (groupid=3, jobs=1): err= 0: pid=20446
write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec
slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37
clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27
lat (msec): min=22 , max=5639 , avg=298.52, stdev=430.84
bw (KB/s) : min= 0, max= 7728, per=31.44%, avg=1078.21,
stdev=2000.45
cpu : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
=64=0.1%
issued r/w/d: total=0/52147/0, short=0/0/0

lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 750=5.10%
lat (msec): 1000=2.91%, 2000=5.75%, >=2000=1.33%

Run status group 0 (all jobs):
READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, maxb=13673KB/s,
mint=60053msec, maxt=60053msec

Run status group 1 (all jobs):
READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s,
maxb=111425KB/s, mint=60005msec, maxt=60005msec

Run status group 2 (all jobs):
WRITE: io=44684KB, aggrb=735KB/s, minb=753KB/s, maxb=753KB/s,
mint=60725msec, maxt=60725msec

Run status group 3 (all jobs):
WRITE: io=208588KB, aggrb=3429KB/s, minb=3511KB/s, maxb=3511KB/s,
mint=60822msec, maxt=60822msec

Disk stats (read/write):
rbd1: ios=1832984/63270, merge=0/0, ticks=16374236/17012132,
in_queue=33434120, util=99.79%
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html