Re: RBD fio Performance concerns

Gregory Farnum <greg@xxxxxxxxxxx> · Sat, 24 Nov 2012 08:59:40 -0800



On Friday, November 23, 2012 at 5:36 AM, Chen, Xiaoxi wrote:
> Hi Han,
> I have a cluster with 8 nodes(each node with 1 SSD as journal and 3 7200 rpm sata disk as data disk), each OSD consist of 1 sata disk together with one 30G partition from the SSD.So in total I have 24 OSDs.
> My test method is start 24VMs and 24 RBD volumes, make the VM and volume 1:1 paired. Then Aiostress is used as test tools.
> In total, I will get ~1000 IOPS for sequential 4K write for each volume and ~60 IOPS for random 4K write.  
> But there still some strange things on my cluster which I cannot explain the reason,if I clean the pagecache on ceph clusters BEFORE the test, performance drops to half. I donâ€™t understand why old pagecache has any connect with write performance
> Xiaoxi

That's because when you dump out the page cache you're clearing out all of the OSD's data directory inodes from cache, so it needs to do a bunch of random IO disk hops to read them in, but normally they'd be in-memory since there aren't that many of them and they're accessed pretty frequently. ;)
-Greg

  
>  
> -----Original Message-----
> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of SÃ©bastien Han
> Sent: 2012å¹´11æœˆ22æ—¥ 5:47
> To: Mark Nelson
> Cc: Alexandre DERUMIER; ceph-devel; Mark Kampe
> Subject: Re: RBD fio Performance concerns
>  
> Hi Mark,
>  
> Well the most concerning thing is that I have 2 Ceph clusters and both of them show better rand than seq...
> I don't have enough background to argue on your assomptions but I could try to skrink my test platform to a single OSD and how it performs. We keep in touch on that one.
>  
> But it seems that Alexandre and I have the same results (more rand than seq), he has (at least) one cluster and I have 2. Thus I start to think that's not an isolated issue.
>  
> Is it different for you? Do you usually get more seq IOPS from an RBD thant rand?
>  
>  
> On Wed, Nov 21, 2012 at 5:34 PM, Mark Nelson <mark.nelson@xxxxxxxxxxx (mailto:mark.nelson@xxxxxxxxxxx)> wrote:
> > Responding to my own message. :)
> >  
> > Talked to Sage a bit offline about this. I think there are two  
> > opposing
> > forces:
> >  
> > On one hand, random IO may be spreading reads/writes out across more  
> > OSDs than sequential IO that presumably would be hitting a single OSD  
> > more regularly.
> >  
> > On the other hand, you'd expect that sequential writes would be  
> > getting coalesced either at the RBD layer or on the OSD, and that the  
> > drive/controller/filesystem underneath the OSD would be doing some  
> > kind of readahead or prefetching.
> >  
> > On the third hand, maybe coalescing/prefetching is in fact happening  
> > but we are IOP limited by some per-osd limitation.
> >  
> > It could be interesting to do the test with a single OSD and see what  
> > happens.
> >  
> > Mark
> >  
> >  
> > On 11/21/2012 09:52 AM, Mark Nelson wrote:
> > >  
> > > Hi Guys,
> > >  
> > > I'm late to this thread but thought I'd chime in. Crazy that you are  
> > > getting higher performance with random reads/writes vs sequential!  
> > > It would be interesting to see what kind of throughput smalliobench  
> > > reports (should be packaged in bobtail) and also see if this behavior  
> > > happens with cephfs. It's still too early in the morning for me  
> > > right now to come up with a reasonable explanation for what's going  
> > > on. It might be worth running blktrace and seekwatcher to see what  
> > > the io patterns on the underlying disk look like in each case. Maybe  
> > > something unexpected is going on.
> > >  
> > > Mark
> > >  
> > > On 11/19/2012 02:57 PM, SÃ©bastien Han wrote:
> > > >  
> > > > Which iodepth did you use for those benchs?
> > > >  
> > > >  
> > > > > I really don't understand why I can't get more rand read iops with  
> > > > > 4K block ...
> > > >  
> > > >  
> > > >  
> > > >  
> > > > Me neither, hope to get some clarification from the Inktank guys. It  
> > > > doesn't make any sense to me...
> > > > --
> > > > Bien cordialement.
> > > > SÃ©bastien HAN.
> > > >  
> > > >  
> > > > On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER  
> > > > <aderumier@xxxxxxxxx (mailto:aderumier@xxxxxxxxx)> wrote:
> > > > > > >  
> > > > > > > @Alexandre: is it the same for you? or do you always get more  
> > > > > > > IOPS with seq?
> > > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > > rand read 4K : 6000 iops
> > > > > seq read 4K : 3500 iops
> > > > > seq read 4M : 31iops (1gigabit client bandwith limit)
> > > > >  
> > > > > rand write 4k: 6000iops (tmpfs journal) seq write 4k: 1600iops seq  
> > > > > write 4M : 31iops (1gigabit client bandwith limit)
> > > > >  
> > > > >  
> > > > > I really don't understand why I can't get more rand read iops with  
> > > > > 4K block ...
> > > > >  
> > > > > I try with high end cpu for client, it doesn't change nothing.
> > > > > But test cluster use old 8 cores E5420 @ 2.50GHZ (But cpu is  
> > > > > around 15% on cluster during read bench)
> > > > >  
> > > > >  
> > > > > ----- Mail original -----
> > > > >  
> > > > > De: "SÃ©bastien Han" <han.sebastien@xxxxxxxxx (mailto:han.sebastien@xxxxxxxxx)>
> > > > > Ã€: "Mark Kampe" <mark.kampe@xxxxxxxxxxx (mailto:mark.kampe@xxxxxxxxxxx)>
> > > > > Cc: "Alexandre DERUMIER" <aderumier@xxxxxxxxx (mailto:aderumier@xxxxxxxxx)>, "ceph-devel"
> > > > > <ceph-devel@xxxxxxxxxxxxxxx (mailto:ceph-devel@xxxxxxxxxxxxxxx)>
> > > > > EnvoyÃ©: Lundi 19 Novembre 2012 19:03:40
> > > > > Objet: Re: RBD fio Performance concerns
> > > > >  
> > > > > @Sage, thanks for the info :)
> > > > > @Mark:
> > > > >  
> > > > > > If you want to do sequential I/O, you should do it buffered (so  
> > > > > > that the writes can be aggregated) or with a 4M block size (very  
> > > > > > efficient and avoiding object serialization).
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > > The original benchmark has been performed with 4M block size. And  
> > > > > as you can see I still get more IOPS with rand than seq... I just  
> > > > > tried with 4M without direct I/O, still the same. I can print fio  
> > > > > results if it's needed.
> > > > >  
> > > > > > We do direct writes for benchmarking, not because it is a  
> > > > > > reasonable way to do I/O, but because it bypasses the buffer cache  
> > > > > > and enables us to directly measure cluster I/O throughput (which  
> > > > > > is what we are trying to optimize). Applications should usually do  
> > > > > > buffered I/O, to get the (very significant) benefits of caching  
> > > > > > and write aggregation.
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > > I know why I use direct I/O. It's synthetic benchmarks, it's far  
> > > > > away from a real life scenario and how common applications works. I  
> > > > > just try to see the maximum I/O throughput that I can get from my  
> > > > > RBD. All my applications use buffered I/O.
> > > > >  
> > > > > @Alexandre: is it the same for you? or do you always get more IOPS  
> > > > > with seq?
> > > > >  
> > > > > Thanks to all of you..
> > > > >  
> > > > >  
> > > > > On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe  
> > > > > <mark.kampe@xxxxxxxxxxx (mailto:mark.kampe@xxxxxxxxxxx)>
> > > > > wrote:
> > > > > >  
> > > > > > Recall:
> > > > > > 1. RBD volumes are striped (4M wide) across RADOS objects 2.  
> > > > > > distinct writes to a single RADOS object are serialized
> > > > > >  
> > > > > > Your sequential 4K writes are direct, depth=256, so there are (at  
> > > > > > all times) 256 writes queued to the same object. All of your  
> > > > > > writes are waiting through a very long line, which is adding  
> > > > > > horrendous latency.
> > > > > >  
> > > > > > If you want to do sequential I/O, you should do it buffered (so  
> > > > > > that the writes can be aggregated) or with a 4M block size (very  
> > > > > > efficient and avoiding object serialization).
> > > > > >  
> > > > > > We do direct writes for benchmarking, not because it is a  
> > > > > > reasonable way to do I/O, but because it bypasses the buffer cache  
> > > > > > and enables us to directly measure cluster I/O throughput (which  
> > > > > > is what we are trying to optimize). Applications should usually do  
> > > > > > buffered I/O, to get the (very significant) benefits of caching  
> > > > > > and write aggregation.
> > > > > >  
> > > > > >  
> > > > > > > That's correct for some of the benchmarks. However even with 4K  
> > > > > > > for seq, I still get less IOPS. See below my last fio:
> > > > > > >  
> > > > > > > # fio rbd-bench.fio
> > > > > > > seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio,  
> > > > > > > iodepth=256
> > > > > > > rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio,
> > > > > > > iodepth=256
> > > > > > > seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio,
> > > > > > > iodepth=256
> > > > > > > rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio,
> > > > > > > iodepth=256
> > > > > > > fio 1.59
> > > > > > > Starting 4 processes
> > > > > > > Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta  
> > > > > > > 02m:59s]
> > > > > > > seq-read: (groupid=0, jobs=1): err= 0: pid=15096 read :  
> > > > > > > io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec slat  
> > > > > > > (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90 clat  
> > > > > > > (msec): min=18 , max=133 , avg=76.37, stdev=16.63 lat (msec):  
> > > > > > > min=18 , max=133 , avg=76.67, stdev=16.62 bw (KB/s) : min= 0,  
> > > > > > > max=14406, per=31.89%, avg=4258.24,
> > > > > > > stdev=6239.06
> > > > > > > cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279 IO  
> > > > > > > depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
> > > > > > > >  
> > > > > > > > =64=100.0%
> > > > > > >  
> > > > > > > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > > > > > > >  
> > > > > > > > =64=0.0%
> > > > > > >  
> > > > > > > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > > > > > > >  
> > > > > > > > =64=0.1%
> > > > > > >  
> > > > > > > issued r/w/d: total=200473/0/0, short=0/0/0
> > > > > > >  
> > > > > > > lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10%
> > > > > > > rand-read: (groupid=1, jobs=1): err= 0: pid=16846 read :  
> > > > > > > io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec slat  
> > > > > > > (usec): min=8 , max=12723 , avg=33.54, stdev=59.87 clat (usec):  
> > > > > > > min=4642 , max=55760 , avg=9374.10, stdev=970.40 lat (usec):  
> > > > > > > min=4671 , max=55788 , avg=9408.00, stdev=971.21 bw (KB/s) :  
> > > > > > > min=105496, max=109136, per=100.00%, avg=108815.48,
> > > > > > > stdev=648.62
> > > > > > > cpu : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278 IO  
> > > > > > > depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
> > > > > > > >  
> > > > > > > > =64=100.0%
> > > > > > >  
> > > > > > > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > > > > > > >  
> > > > > > > > =64=0.0%
> > > > > > >  
> > > > > > > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > > > > > > >  
> > > > > > > > =64=0.1%
> > > > > > >  
> > > > > > > issued r/w/d: total=1632349/0/0, short=0/0/0
> > > > > > >  
> > > > > > > lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01%
> > > > > > > seq-write: (groupid=2, jobs=1): err= 0: pid=18653
> > > > > > > write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec slat  
> > > > > > > (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97 clat  
> > > > > > > (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19 lat (msec):  
> > > > > > > min=25 , max=4868 , avg=1389.62, stdev=470.17 bw (KB/s) : min= 7,  
> > > > > > > max= 2165, per=104.03%, avg=764.65,
> > > > > > > stdev=353.97
> > > > > > > cpu : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21 IO depths :  
> > > > > > > 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%,
> > > > > > > >  
> > > > > > > > =64=99.4%
> > > > > > >  
> > > > > > > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > > > > > > >  
> > > > > > > > =64=0.0%
> > > > > > >  
> > > > > > > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > > > > > > >  
> > > > > > > > =64=0.1%
> > > > > > >  
> > > > > > > issued r/w/d: total=0/11171/0, short=0/0/0
> > > > > > >  
> > > > > > > lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60%  
> > > > > > > lat (msec): 1000=12.73%, 2000=66.36%, >=2000=13.20%
> > > > > > > rand-write: (groupid=3, jobs=1): err= 0: pid=20446
> > > > > > > write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec  
> > > > > > > slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37  
> > > > > > > clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27 lat  
> > > > > > > (msec): min=22 , max=5639 , avg=298.52, stdev=430.84 bw (KB/s) :  
> > > > > > > min= 0, max= 7728, per=31.44%, avg=1078.21,
> > > > > > > stdev=2000.45
> > > > > > > cpu : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19 IO depths  
> > > > > > > : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
> > > > > > > >  
> > > > > > > > =64=99.9%
> > > > > > >  
> > > > > > > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > > > > > > >  
> > > > > > > > =64=0.0%
> > > > > > >  
> > > > > > > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > > > > > > >  
> > > > > > > > =64=0.1%
> > > > > > >  
> > > > > > > issued r/w/d: total=0/52147/0, short=0/0/0
> > > > > > >  
> > > > > > > lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%,  
> > > > > > > 750=5.10% lat (msec): 1000=2.91%, 2000=5.75%, >=2000=1.33%
> > > > > > >  
> > > > > > > Run status group 0 (all jobs):
> > > > > > > READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s,  
> > > > > > > maxb=13673KB/s, mint=60053msec, maxt=60053msec
> > > > > > >  
> > > > > > > Run status group 1 (all jobs):
> > > > > > > READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s,  
> > > > > > > maxb=111425KB/s, mint=60005msec, maxt=60005msec
> > > > > > >  
> > > > > > > Run status group 2 (all jobs):
> > > > > > > WRITE: io=44684KB, aggrb=735KB/s, minb=753KB/s, maxb=753KB/s,  
> > > > > > > mint=60725msec, maxt=60725msec
> > > > > > >  
> > > > > > > Run status group 3 (all jobs):
> > > > > > > WRITE: io=208588KB, aggrb=3429KB/s, minb=3511KB/s, maxb=3511KB/s,  
> > > > > > > mint=60822msec, maxt=60822msec
> > > > > > >  
> > > > > > > Disk stats (read/write):
> > > > > > > rbd1: ios=1832984/63270, merge=0/0, ticks=16374236/17012132,  
> > > > > > > in_queue=33434120, util=99.79%
> > > > > >  
> > > > >  
> > > >  
> > > >  
> > > >  
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe  
> > > > ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx (mailto:majordomo@xxxxxxxxxxxxxxx)  
> > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >  
> >  
>  
>  
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
> ¢éì¹»®&Þ~º&¶¬–+-±éÝ¶¥Šw®žË›±Êâmç¦^½ébžØ^n‡r¡ö¦zË?ëh™¨èÚ&¢ø®G«?éh®(éšŽŠÝ¢j"?ú¶m§ÿï?êäz¹Þ–Šàþf£¢·hšˆ§~ˆmš


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html