Re: RBD fio Performance concerns

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Thu, 22 Nov 2012 08:34:29 -0600

On 11/22/2012 04:49 AM, Sébastien Han wrote:
@Alexandre: cool!

@ Stefan: Full SSD cluster and 10G switches? Couple of weeks ago I saw
that you use journal aio, did you notice performance improvement with it?

@Mark Kampe

 > If I read the above correctly, your random operations are 4K and your
 > sequential operations are 4M.

As you recommend. (see below what you previously said):

 > If you want to do sequential I/O, you should do it buffered
 > (so that the writes can be aggregated) or with a 4M block size
 > (very efficient and avoiding object serialization).

 > The block-size difference makes the random and sequential
 > results incomparable.

Ok let's do it again then (short output that fits on a screen), with a
single OSD and 4K blocks:

seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=42
rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=42
seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=42
rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=42

seq-read: (groupid=0, jobs=1): err= 0: pid=7542
   read : io=912716KB, bw=15210KB/s, iops=3802 , runt= 60009msec
rand-read: (groupid=1, jobs=1): err= 0: pid=7546
   read : io=980504KB, bw=16339KB/s, iops=4084 , runt= 60009msec
seq-write: (groupid=2, jobs=1): err= 0: pid=7547
   write: io=54216KB, bw=922718 B/s, iops=225 , runt= 60167msec
rand-write: (groupid=3, jobs=1): err= 0: pid=7557
   write: io=66116KB, bw=1098.5KB/s, iops=274 , runt= 60192msec

Sequentials and random operations are getting closer to each others, but
random operations remain higher.

 > the more data you send me, the longer it takes me to find
 > time to review your results.  If you send me a message that
 > fits on a single screen, I will try to answer it immediately.

I just don't want to miss any information that you may find useful.

@ Mark Nelson

See below the new blkparse trace with 4K block for all operations:

Inline image 1

Thanks for doing this!  Unfortunately it's only showing writes so we 
don't know what read behavior looks like in these graphs.  That might be 
important.  Also, do you know approximately how the tests line up with 
timestamps on the seekwatcher results?  Seekwatcher only seems to be 
going for 228 seconds, but the 4 tests should be lasting 240+ seconds?

If I just break this down into 60s chunks with seq-read, rand-read, 
seq-write, rand-write in that order, my wildly speculative guess is that 
there's a chunk of time missing at the beginning where the read tests 
are happening that extend out to about second ~80-85 in the graph. 
After that it's the write tests from ~85 out to ~205.  After that, I 
guess that there are no more incoming writes, but existing writes are 
being flushed and that with no incoming writes we see a bump in 
performance (possibly due to a reduction in lock contention? What 
version of ceph is this again?)

A couple of thoughts:

- You may want to pause between the tests for a while (or even reformat 
between every test!)

- That spike in performance at the end is interesting.  I'd really like 
to know if that happened during the rand-write test or after the test 
completed (once the data hit the jouranl the writes would have been 
acknowledged letting the test end while the data was still being flushed 
out to disk).

- If my interpretation is right, it looks like the typical seq-write 
throughput is slightly higher according to blktrace but with regular 
dips, while the random write performance is typically lower but with no 
dips (and maybe a big increase at the end).  In both cases we have a 
very high number of seeks! Do you have WB cache on your controller? 
These are 10K RPM drives?

- Read behavior would be really useful!

- You can pretty clearly see the different AGs in XFS doing their thing. 
 I wonder if 1 AG would be better here.  On the other hand, there's a 
pretty long thread that discusses IOPs heavy workload on XFS here:

http://xfs.9218.n7.nabble.com/XFS-Abysmal-write-performance-because-of-excessive-seeking-allocation-groups-to-blame-td15501.html

- It would be very interesting to try this test on EXT4 or BTRFS and see 
if the results are the same.  I forget, did someone already do this?

Mark

Thanks again everyone, for your help.

Cheers!

On Thu, Nov 22, 2012 at 11:19 AM, Stefan Priebe - Profihost AG
<s.priebe@xxxxxxxxxxxx <mailto:s.priebe@xxxxxxxxxxxx>> wrote:
 >
 >
 > Same to me:
 > rand 4k: 23.000 iops
 > seq 4k: 13.000 iops
 >
 > Even in writeback mode where normally seq 4k should be merged into
bigger requests.
 >
 > Stefan
 >
 > Am 21.11.2012 17:34, schrieb Mark Nelson:
 >
 >> Responding to my own message. :)
 >>
 >> Talked to Sage a bit offline about this.  I think there are two opposing
 >> forces:
 >>
 >> On one hand, random IO may be spreading reads/writes out across more
 >> OSDs than sequential IO that presumably would be hitting a single OSD
 >> more regularly.
 >>
 >> On the other hand, you'd expect that sequential writes would be getting
 >> coalesced either at the RBD layer or on the OSD, and that the
 >> drive/controller/filesystem underneath the OSD would be doing some kind
 >> of readahead or prefetching.
 >>
 >> On the third hand, maybe coalescing/prefetching is in fact happening but
 >> we are IOP limited by some per-osd limitation.
 >>
 >> It could be interesting to do the test with a single OSD and see what
 >> happens.
 >>
 >> Mark
 >>
 >> On 11/21/2012 09:52 AM, Mark Nelson wrote:
 >>>
 >>> Hi Guys,
 >>>
 >>> I'm late to this thread but thought I'd chime in.  Crazy that you are
 >>> getting higher performance with random reads/writes vs sequential!  It
 >>> would be interesting to see what kind of throughput smalliobench
reports
 >>> (should be packaged in bobtail) and also see if this behavior happens
 >>> with cephfs.  It's still too early in the morning for me right now to
 >>> come up with a reasonable explanation for what's going on.  It might be
 >>> worth running blktrace and seekwatcher to see what the io patterns on
 >>> the underlying disk look like in each case.  Maybe something unexpected
 >>> is going on.
 >>>
 >>> Mark
 >>>
 >>> On 11/19/2012 02:57 PM, Sébastien Han wrote:
 >>>>
 >>>> Which iodepth did you use for those benchs?
 >>>>
 >>>>
 >>>>> I really don't understand why I can't get more rand read iops with 4K
 >>>>> block ...
 >>>>
 >>>>
 >>>> Me neither, hope to get some clarification from the Inktank guys. It
 >>>> doesn't make any sense to me...
 >>>> --
 >>>> Bien cordialement.
 >>>> Sébastien HAN.
 >>>>
 >>>>
 >>>> On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER
 >>>> <aderumier@xxxxxxxxx <mailto:aderumier@xxxxxxxxx>> wrote:
 >>>>>>>
 >>>>>>> @Alexandre: is it the same for you? or do you always get more IOPS
 >>>>>>> with seq?
 >>>>>
 >>>>>
 >>>>> rand read 4K : 6000 iops
 >>>>> seq read 4K : 3500 iops
 >>>>> seq read 4M : 31iops (1gigabit client bandwith limit)
 >>>>>
 >>>>> rand write 4k: 6000iops  (tmpfs journal)
 >>>>> seq write 4k: 1600iops
 >>>>> seq write 4M : 31iops (1gigabit client bandwith limit)
 >>>>>
 >>>>>
 >>>>> I really don't understand why I can't get more rand read iops with 4K
 >>>>> block ...
 >>>>>
 >>>>> I try with high end cpu for client, it doesn't change nothing.
 >>>>> But test cluster use  old 8 cores E5420  @ 2.50GHZ (But cpu is around
 >>>>> 15% on cluster during read bench)
 >>>>>
 >>>>>
 >>>>> ----- Mail original -----
 >>>>>
 >>>>> De: "Sébastien Han" <han.sebastien@xxxxxxxxx
<mailto:han.sebastien@xxxxxxxxx>>
 >>>>> À: "Mark Kampe" <mark.kampe@xxxxxxxxxxx
<mailto:mark.kampe@xxxxxxxxxxx>>
 >>>>> Cc: "Alexandre DERUMIER" <aderumier@xxxxxxxxx
<mailto:aderumier@xxxxxxxxx>>, "ceph-devel"
 >>>>> <ceph-devel@xxxxxxxxxxxxxxx <mailto:ceph-devel@xxxxxxxxxxxxxxx>>
 >>>>> Envoyé: Lundi 19 Novembre 2012 19:03:40
 >>>>> Objet: Re: RBD fio Performance concerns
 >>>>>
 >>>>> @Sage, thanks for the info :)
 >>>>> @Mark:
 >>>>>
 >>>>>> If you want to do sequential I/O, you should do it buffered
 >>>>>> (so that the writes can be aggregated) or with a 4M block size
 >>>>>> (very efficient and avoiding object serialization).
 >>>>>
 >>>>>
 >>>>> The original benchmark has been performed with 4M block size. And as
 >>>>> you can see I still get more IOPS with rand than seq... I just tried
 >>>>> with 4M without direct I/O, still the same. I can print fio
results if
 >>>>> it's needed.
 >>>>>
 >>>>>> We do direct writes for benchmarking, not because it is a reasonable
 >>>>>> way to do I/O, but because it bypasses the buffer cache and enables
 >>>>>> us to directly measure cluster I/O throughput (which is what we are
 >>>>>> trying to optimize). Applications should usually do buffered I/O,
 >>>>>> to get the (very significant) benefits of caching and write
 >>>>>> aggregation.
 >>>>>
 >>>>>
 >>>>> I know why I use direct I/O. It's synthetic benchmarks, it's far away
 >>>>> from a real life scenario and how common applications works. I just
 >>>>> try to see the maximum I/O throughput that I can get from my RBD. All
 >>>>> my applications use buffered I/O.
 >>>>>
 >>>>> @Alexandre: is it the same for you? or do you always get more IOPS
 >>>>> with seq?
 >>>>>
 >>>>> Thanks to all of you..
 >>>>>
 >>>>>
 >>>>> On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe
<mark.kampe@xxxxxxxxxxx <mailto:mark.kampe@xxxxxxxxxxx>>
 >>>>> wrote:
 >>>>>>
 >>>>>> Recall:
 >>>>>> 1. RBD volumes are striped (4M wide) across RADOS objects
 >>>>>> 2. distinct writes to a single RADOS object are serialized
 >>>>>>
 >>>>>> Your sequential 4K writes are direct, depth=256, so there are
 >>>>>> (at all times) 256 writes queued to the same object. All of
 >>>>>> your writes are waiting through a very long line, which is adding
 >>>>>> horrendous latency.
 >>>>>>
 >>>>>> If you want to do sequential I/O, you should do it buffered
 >>>>>> (so that the writes can be aggregated) or with a 4M block size
 >>>>>> (very efficient and avoiding object serialization).
 >>>>>>
 >>>>>> We do direct writes for benchmarking, not because it is a reasonable
 >>>>>> way to do I/O, but because it bypasses the buffer cache and enables
 >>>>>> us to directly measure cluster I/O throughput (which is what we are
 >>>>>> trying to optimize). Applications should usually do buffered I/O,
 >>>>>> to get the (very significant) benefits of caching and write
 >>>>>> aggregation.
 >>>>>>
 >>>>>>
 >>>>>>> That's correct for some of the benchmarks. However even with 4K for
 >>>>>>> seq, I still get less IOPS. See below my last fio:
 >>>>>>>
 >>>>>>> # fio rbd-bench.fio
 >>>>>>> seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio,
 >>>>>>> iodepth=256
 >>>>>>> rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio,
 >>>>>>> iodepth=256
 >>>>>>> seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio,
 >>>>>>> iodepth=256
 >>>>>>> rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio,
 >>>>>>> iodepth=256
 >>>>>>> fio 1.59
 >>>>>>> Starting 4 processes
 >>>>>>> Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta
 >>>>>>> 02m:59s]
 >>>>>>> seq-read: (groupid=0, jobs=1): err= 0: pid=15096
 >>>>>>> read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec
 >>>>>>> slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90
 >>>>>>> clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63
 >>>>>>> lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62
 >>>>>>> bw (KB/s) : min= 0, max=14406, per=31.89%, avg=4258.24,
 >>>>>>> stdev=6239.06
 >>>>>>> cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279
 >>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
 >>>>>>>>
 >>>>>>>> =64=100.0%
 >>>>>>>
 >>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 >>>>>>>>
 >>>>>>>> =64=0.0%
 >>>>>>>
 >>>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 >>>>>>>>
 >>>>>>>> =64=0.1%
 >>>>>>>
 >>>>>>> issued r/w/d: total=200473/0/0, short=0/0/0
 >>>>>>>
 >>>>>>> lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10%
 >>>>>>> rand-read: (groupid=1, jobs=1): err= 0: pid=16846
 >>>>>>> read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec
 >>>>>>> slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87
 >>>>>>> clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40
 >>>>>>> lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21
 >>>>>>> bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48,
 >>>>>>> stdev=648.62
 >>>>>>> cpu : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278
 >>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
 >>>>>>>>
 >>>>>>>> =64=100.0%
 >>>>>>>
 >>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 >>>>>>>>
 >>>>>>>> =64=0.0%
 >>>>>>>
 >>>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 >>>>>>>>
 >>>>>>>> =64=0.1%
 >>>>>>>
 >>>>>>> issued r/w/d: total=1632349/0/0, short=0/0/0
 >>>>>>>
 >>>>>>> lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01%
 >>>>>>> seq-write: (groupid=2, jobs=1): err= 0: pid=18653
 >>>>>>> write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec
 >>>>>>> slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97
 >>>>>>> clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19
 >>>>>>> lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17
 >>>>>>> bw (KB/s) : min= 7, max= 2165, per=104.03%, avg=764.65,
 >>>>>>> stdev=353.97
 >>>>>>> cpu : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21
 >>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%,
 >>>>>>>>
 >>>>>>>> =64=99.4%
 >>>>>>>
 >>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 >>>>>>>>
 >>>>>>>> =64=0.0%
 >>>>>>>
 >>>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 >>>>>>>>
 >>>>>>>> =64=0.1%
 >>>>>>>
 >>>>>>> issued r/w/d: total=0/11171/0, short=0/0/0
 >>>>>>>
 >>>>>>> lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60%
 >>>>>>> lat (msec): 1000=12.73%, 2000=66.36%, >=2000=13.20%
 >>>>>>> rand-write: (groupid=3, jobs=1): err= 0: pid=20446
 >>>>>>> write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec
 >>>>>>> slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37
 >>>>>>> clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27
 >>>>>>> lat (msec): min=22 , max=5639 , avg=298.52, stdev=430.84
 >>>>>>> bw (KB/s) : min= 0, max= 7728, per=31.44%, avg=1078.21,
 >>>>>>> stdev=2000.45
 >>>>>>> cpu : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19
 >>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
 >>>>>>>>
 >>>>>>>> =64=99.9%
 >>>>>>>
 >>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 >>>>>>>>
 >>>>>>>> =64=0.0%
 >>>>>>>
 >>>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 >>>>>>>>
 >>>>>>>> =64=0.1%
 >>>>>>>
 >>>>>>> issued r/w/d: total=0/52147/0, short=0/0/0
 >>>>>>>
 >>>>>>> lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 750=5.10%
 >>>>>>> lat (msec): 1000=2.91%, 2000=5.75%, >=2000=1.33%
 >>>>>>>
 >>>>>>> Run status group 0 (all jobs):
 >>>>>>> READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, maxb=13673KB/s,
 >>>>>>> mint=60053msec, maxt=60053msec
 >>>>>>>
 >>>>>>> Run status group 1 (all jobs):
 >>>>>>> READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s,
 >>>>>>> maxb=111425KB/s, mint=60005msec, maxt=60005msec
 >>>>>>>
 >>>>>>> Run status group 2 (all jobs):
 >>>>>>> WRITE: io=44684KB, aggrb=735KB/s, minb=753KB/s, maxb=753KB/s,
 >>>>>>> mint=60725msec, maxt=60725msec
 >>>>>>>
 >>>>>>> Run status group 3 (all jobs):
 >>>>>>> WRITE: io=208588KB, aggrb=3429KB/s, minb=3511KB/s, maxb=3511KB/s,
 >>>>>>> mint=60822msec, maxt=60822msec
 >>>>>>>
 >>>>>>> Disk stats (read/write):
 >>>>>>> rbd1: ios=1832984/63270, merge=0/0, ticks=16374236/17012132,
 >>>>>>> in_queue=33434120, util=99.79%
 >>>>
 >>>> --
 >>>> To unsubscribe from this list: send the line "unsubscribe
ceph-devel" in
 >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
<mailto:majordomo@xxxxxxxxxxxxxxx>
 >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
 >>>>
 >>>
 >>
 >> --
 >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
 >> the body of a message to majordomo@xxxxxxxxxxxxxxx
<mailto:majordomo@xxxxxxxxxxxxxxx>
 >> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Mark Nelson
Performance Engineer
Inktank
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html