RE: RBD fio Performance concerns

"Chen, Xiaoxi" <xiaoxi.chen@xxxxxxxxx> · Fri, 23 Nov 2012 13:36:36 +0000

Hi Han,
      I have a cluster with 8 nodes(each node with 1 SSD as journal and 3 7200 rpm sata disk as data disk), each OSD consist of 1 sata disk together with one 30G partition from the SSD.So in total I have 24 OSDs.
	My test method is start 24VMs and 24 RBD volumes, make the VM and volume 1:1 paired. Then Aiostress is used as test tools.
	In total, I will get ~1000 IOPS for sequential 4K write for each volume and  ~60 IOPS for random 4K write. 
But there still some strange things on my cluster which I cannot explain the reason,if I clean the pagecache on ceph clusters BEFORE the test, performance drops to half. I don’t understand why old pagecache has any connect with write performance
                                                                                   Xiaoxi

-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Sébastien Han
Sent: 2012年11月22日 5:47
To: Mark Nelson
Cc: Alexandre DERUMIER; ceph-devel; Mark Kampe
Subject: Re: RBD fio Performance concerns

Hi Mark,

Well the most concerning thing is that I have 2 Ceph clusters and both of them show better rand than seq...
I don't have enough background to argue on your assomptions but I could try to skrink my test platform to a single OSD and how it performs. We keep in touch on that one.

But it seems that Alexandre and I have the same results (more rand than seq), he has (at least) one cluster and I have 2. Thus I start to think that's not an isolated issue.

Is it different for you? Do you usually get more seq IOPS from an RBD thant rand?

On Wed, Nov 21, 2012 at 5:34 PM, Mark Nelson <mark.nelson@xxxxxxxxxxx> wrote:
> Responding to my own message. :)
>
> Talked to Sage a bit offline about this.  I think there are two 
> opposing
> forces:
>
> On one hand, random IO may be spreading reads/writes out across more 
> OSDs than sequential IO that presumably would be hitting a single OSD 
> more regularly.
>
> On the other hand, you'd expect that sequential writes would be 
> getting coalesced either at the RBD layer or on the OSD, and that the 
> drive/controller/filesystem underneath the OSD would be doing some 
> kind of readahead or prefetching.
>
> On the third hand, maybe coalescing/prefetching is in fact happening 
> but we are IOP limited by some per-osd limitation.
>
> It could be interesting to do the test with a single OSD and see what 
> happens.
>
> Mark
>
>
> On 11/21/2012 09:52 AM, Mark Nelson wrote:
>>
>> Hi Guys,
>>
>> I'm late to this thread but thought I'd chime in.  Crazy that you are 
>> getting higher performance with random reads/writes vs sequential!  
>> It would be interesting to see what kind of throughput smalliobench 
>> reports (should be packaged in bobtail) and also see if this behavior 
>> happens with cephfs.  It's still too early in the morning for me 
>> right now to come up with a reasonable explanation for what's going 
>> on.  It might be worth running blktrace and seekwatcher to see what 
>> the io patterns on the underlying disk look like in each case.  Maybe 
>> something unexpected is going on.
>>
>> Mark
>>
>> On 11/19/2012 02:57 PM, Sébastien Han wrote:
>>>
>>> Which iodepth did you use for those benchs?
>>>
>>>
>>>> I really don't understand why I can't get more rand read iops with 
>>>> 4K block ...
>>>
>>>
>>> Me neither, hope to get some clarification from the Inktank guys. It 
>>> doesn't make any sense to me...
>>> --
>>> Bien cordialement.
>>> Sébastien HAN.
>>>
>>>
>>> On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER 
>>> <aderumier@xxxxxxxxx> wrote:
>>>>>>
>>>>>> @Alexandre: is it the same for you? or do you always get more 
>>>>>> IOPS with seq?
>>>>
>>>>
>>>> rand read 4K : 6000 iops
>>>> seq read 4K : 3500 iops
>>>> seq read 4M : 31iops (1gigabit client bandwith limit)
>>>>
>>>> rand write 4k: 6000iops  (tmpfs journal) seq write 4k: 1600iops seq 
>>>> write 4M : 31iops (1gigabit client bandwith limit)
>>>>
>>>>
>>>> I really don't understand why I can't get more rand read iops with 
>>>> 4K block ...
>>>>
>>>> I try with high end cpu for client, it doesn't change nothing.
>>>> But test cluster use  old 8 cores E5420  @ 2.50GHZ (But cpu is 
>>>> around 15% on cluster during read bench)
>>>>
>>>>
>>>> ----- Mail original -----
>>>>
>>>> De: "Sébastien Han" <han.sebastien@xxxxxxxxx>
>>>> À: "Mark Kampe" <mark.kampe@xxxxxxxxxxx>
>>>> Cc: "Alexandre DERUMIER" <aderumier@xxxxxxxxx>, "ceph-devel"
>>>> <ceph-devel@xxxxxxxxxxxxxxx>
>>>> Envoyé: Lundi 19 Novembre 2012 19:03:40
>>>> Objet: Re: RBD fio Performance concerns
>>>>
>>>> @Sage, thanks for the info :)
>>>> @Mark:
>>>>
>>>>> If you want to do sequential I/O, you should do it buffered (so 
>>>>> that the writes can be aggregated) or with a 4M block size (very 
>>>>> efficient and avoiding object serialization).
>>>>
>>>>
>>>> The original benchmark has been performed with 4M block size. And 
>>>> as you can see I still get more IOPS with rand than seq... I just 
>>>> tried with 4M without direct I/O, still the same. I can print fio 
>>>> results if it's needed.
>>>>
>>>>> We do direct writes for benchmarking, not because it is a 
>>>>> reasonable way to do I/O, but because it bypasses the buffer cache 
>>>>> and enables us to directly measure cluster I/O throughput (which 
>>>>> is what we are trying to optimize). Applications should usually do 
>>>>> buffered I/O, to get the (very significant) benefits of caching 
>>>>> and write aggregation.
>>>>
>>>>
>>>> I know why I use direct I/O. It's synthetic benchmarks, it's far 
>>>> away from a real life scenario and how common applications works. I 
>>>> just try to see the maximum I/O throughput that I can get from my 
>>>> RBD. All my applications use buffered I/O.
>>>>
>>>> @Alexandre: is it the same for you? or do you always get more IOPS 
>>>> with seq?
>>>>
>>>> Thanks to all of you..
>>>>
>>>>
>>>> On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe 
>>>> <mark.kampe@xxxxxxxxxxx>
>>>> wrote:
>>>>>
>>>>> Recall:
>>>>> 1. RBD volumes are striped (4M wide) across RADOS objects 2. 
>>>>> distinct writes to a single RADOS object are serialized
>>>>>
>>>>> Your sequential 4K writes are direct, depth=256, so there are (at 
>>>>> all times) 256 writes queued to the same object. All of your 
>>>>> writes are waiting through a very long line, which is adding 
>>>>> horrendous latency.
>>>>>
>>>>> If you want to do sequential I/O, you should do it buffered (so 
>>>>> that the writes can be aggregated) or with a 4M block size (very 
>>>>> efficient and avoiding object serialization).
>>>>>
>>>>> We do direct writes for benchmarking, not because it is a 
>>>>> reasonable way to do I/O, but because it bypasses the buffer cache 
>>>>> and enables us to directly measure cluster I/O throughput (which 
>>>>> is what we are trying to optimize). Applications should usually do 
>>>>> buffered I/O, to get the (very significant) benefits of caching 
>>>>> and write aggregation.
>>>>>
>>>>>
>>>>>> That's correct for some of the benchmarks. However even with 4K 
>>>>>> for seq, I still get less IOPS. See below my last fio:
>>>>>>
>>>>>> # fio rbd-bench.fio
>>>>>> seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, 
>>>>>> iodepth=256
>>>>>> rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio,
>>>>>> iodepth=256
>>>>>> seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio,
>>>>>> iodepth=256
>>>>>> rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio,
>>>>>> iodepth=256
>>>>>> fio 1.59
>>>>>> Starting 4 processes
>>>>>> Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta 
>>>>>> 02m:59s]
>>>>>> seq-read: (groupid=0, jobs=1): err= 0: pid=15096 read : 
>>>>>> io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec slat 
>>>>>> (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90 clat 
>>>>>> (msec): min=18 , max=133 , avg=76.37, stdev=16.63 lat (msec): 
>>>>>> min=18 , max=133 , avg=76.67, stdev=16.62 bw (KB/s) : min= 0, 
>>>>>> max=14406, per=31.89%, avg=4258.24,
>>>>>> stdev=6239.06
>>>>>> cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279 IO 
>>>>>> depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>>>>>>
>>>>>>> =64=100.0%
>>>>>>
>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>>
>>>>>>> =64=0.0%
>>>>>>
>>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>>
>>>>>>> =64=0.1%
>>>>>>
>>>>>> issued r/w/d: total=200473/0/0, short=0/0/0
>>>>>>
>>>>>> lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10%
>>>>>> rand-read: (groupid=1, jobs=1): err= 0: pid=16846 read : 
>>>>>> io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec slat 
>>>>>> (usec): min=8 , max=12723 , avg=33.54, stdev=59.87 clat (usec): 
>>>>>> min=4642 , max=55760 , avg=9374.10, stdev=970.40 lat (usec): 
>>>>>> min=4671 , max=55788 , avg=9408.00, stdev=971.21 bw (KB/s) : 
>>>>>> min=105496, max=109136, per=100.00%, avg=108815.48,
>>>>>> stdev=648.62
>>>>>> cpu : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278 IO 
>>>>>> depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>>>>>>
>>>>>>> =64=100.0%
>>>>>>
>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>>
>>>>>>> =64=0.0%
>>>>>>
>>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>>
>>>>>>> =64=0.1%
>>>>>>
>>>>>> issued r/w/d: total=1632349/0/0, short=0/0/0
>>>>>>
>>>>>> lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01%
>>>>>> seq-write: (groupid=2, jobs=1): err= 0: pid=18653
>>>>>> write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec slat 
>>>>>> (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97 clat 
>>>>>> (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19 lat (msec): 
>>>>>> min=25 , max=4868 , avg=1389.62, stdev=470.17 bw (KB/s) : min= 7, 
>>>>>> max= 2165, per=104.03%, avg=764.65,
>>>>>> stdev=353.97
>>>>>> cpu : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21 IO depths : 
>>>>>> 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%,
>>>>>>>
>>>>>>> =64=99.4%
>>>>>>
>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>>
>>>>>>> =64=0.0%
>>>>>>
>>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>>
>>>>>>> =64=0.1%
>>>>>>
>>>>>> issued r/w/d: total=0/11171/0, short=0/0/0
>>>>>>
>>>>>> lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60% 
>>>>>> lat (msec): 1000=12.73%, 2000=66.36%, >=2000=13.20%
>>>>>> rand-write: (groupid=3, jobs=1): err= 0: pid=20446
>>>>>> write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec 
>>>>>> slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37 
>>>>>> clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27 lat 
>>>>>> (msec): min=22 , max=5639 , avg=298.52, stdev=430.84 bw (KB/s) : 
>>>>>> min= 0, max= 7728, per=31.44%, avg=1078.21,
>>>>>> stdev=2000.45
>>>>>> cpu : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19 IO depths 
>>>>>> : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>>>>>>
>>>>>>> =64=99.9%
>>>>>>
>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>>
>>>>>>> =64=0.0%
>>>>>>
>>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>>
>>>>>>> =64=0.1%
>>>>>>
>>>>>> issued r/w/d: total=0/52147/0, short=0/0/0
>>>>>>
>>>>>> lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 
>>>>>> 750=5.10% lat (msec): 1000=2.91%, 2000=5.75%, >=2000=1.33%
>>>>>>
>>>>>> Run status group 0 (all jobs):
>>>>>> READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, 
>>>>>> maxb=13673KB/s, mint=60053msec, maxt=60053msec
>>>>>>
>>>>>> Run status group 1 (all jobs):
>>>>>> READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s, 
>>>>>> maxb=111425KB/s, mint=60005msec, maxt=60005msec
>>>>>>
>>>>>> Run status group 2 (all jobs):
>>>>>> WRITE: io=44684KB, aggrb=735KB/s, minb=753KB/s, maxb=753KB/s, 
>>>>>> mint=60725msec, maxt=60725msec
>>>>>>
>>>>>> Run status group 3 (all jobs):
>>>>>> WRITE: io=208588KB, aggrb=3429KB/s, minb=3511KB/s, maxb=3511KB/s, 
>>>>>> mint=60822msec, maxt=60822msec
>>>>>>
>>>>>> Disk stats (read/write):
>>>>>> rbd1: ios=1832984/63270, merge=0/0, ticks=16374236/17012132, 
>>>>>> in_queue=33434120, util=99.79%
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe 
>>> ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx 
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at  http://vger.kernel.org/majordomo-info.html
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f