performance tests

xelkano@xxxxxxxxxxxx (Xabier Elkano) · Thu, 10 Jul 2014 10:24:41 +0200

El 10/07/14 09:18, Christian Balzer escribi?:
> On Thu, 10 Jul 2014 08:57:56 +0200 Xabier Elkano wrote:
>
>> El 09/07/14 16:53, Christian Balzer escribi?:
>>> On Wed, 09 Jul 2014 07:07:50 -0500 Mark Nelson wrote:
>>>
>>>> On 07/09/2014 06:52 AM, Xabier Elkano wrote:
>>>>> El 09/07/14 13:10, Mark Nelson escribi?:
>>>>>> On 07/09/2014 05:57 AM, Xabier Elkano wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I was doing some tests in my cluster with fio tool, one fio
>>>>>>> instance with 70 jobs, each job writing 1GB random with 4K block
>>>>>>> size. I did this test with 3 variations:
>>>>>>>
>>>>>>> 1- Creating 70 images, 60GB each, in the pool. Using rbd kernel
>>>>>>> module, format and mount each image as ext4. Each fio job writing
>>>>>>> in a separate image/directory. (ioengine=libaio, queue_depth=4,
>>>>>>> direct=1)
>>>>>>>
>>>>>>>      IOPS: 6542
>>>>>>>      AVG LAT: 41ms
>>>>>>>
>>>>>>> 2- Creating 1 large image 4,2TB in the pool. Using rbd kernel
>>>>>>> module, format and mount the image as ext4. Each fio job writing
>>>>>>> in a separate file in the same directory. (ioengine=libaio,
>>>>>>> queue_depth=4,direct=1)
>>>>>>>
>>>>>>>     IOPS: 5899
>>>>>>>     AVG LAT:  47ms
>>>>>>>
>>>>>>> 3- Creating 1 large image 4,2TB in the pool. Use ioengine rbd in
>>>>>>> fio to access the image through librados. (ioengine=rbd,
>>>>>>> queue_depth=4,direct=1)
>>>>>>>
>>>>>>>     IOPS: 2638
>>>>>>>     AVG LAT: 96ms
>>>>>>>
>>>>>>> Do these results make sense? From Ceph perspective, It is better to
>>>>>>> have many small images than a larger one? What is the best approach
>>>>>>> to simulate the workload of 70 VMs?
>>>>>> I'm not sure the difference between the first two cases is enough to
>>>>>> say much yet.  You may need to repeat the test a couple of times to
>>>>>> ensure that the difference is more than noise.  having said that, if
>>>>>> we are seeing an effect, it would be interesting to know what the
>>>>>> latency distribution is like.  is it consistently worse in the 2nd
>>>>>> case or do we see higher spikes at specific times?
>>>>>>
>>>>> I've repeated the tests with similar results. Each test is done with
>>>>> a clean new rbd image, first removing any existing images in the
>>>>> pool and then creating the new image. Between tests I am running:
>>>>>
>>>>>   echo 3 > /proc/sys/vm/drop_caches
>>>>>
>>>>> - In the first test I've created 70 images (60G) and mounted them:
>>>>>
>>>>> /dev/rbd1 on /mnt/fiotest/vtest0
>>>>> /dev/rbd2 on /mnt/fiotest/vtest1
>>>>> ..
>>>>> /dev/rbd70 on /mnt/fiotest/vtest69
>>>>>
>>>>> fio output:
>>>>>
>>>>> rand-write-4k: (groupid=0, jobs=70): err= 0: pid=21852: Tue Jul  8
>>>>> 14:52:56 2014
>>>>>    write: io=2559.5MB, bw=26179KB/s, iops=6542, runt=100116msec
>>>>>      slat (usec): min=18, max=512646, avg=4002.62, stdev=13754.33
>>>>>      clat (usec): min=867, max=579715, avg=37581.64, stdev=55954.19
>>>>>       lat (usec): min=903, max=586022, avg=41957.74, stdev=59276.40
>>>>>      clat percentiles (msec):
>>>>>       |  1.00th=[    5],  5.00th=[   10], 10.00th=[   13],
>>>>> 20.00th=[   18], | 30.00th=[   21], 40.00th=[   26], 50.00th=[   31],
>>>>> 60.00th=[   34], | 70.00th=[   37], 80.00th=[   41], 90.00th=[   48],
>>>>> 95.00th=[   61], | 99.00th=[  404], 99.50th=[  445], 99.90th=[  494],
>>>>> 99.95th=[  515], | 99.99th=[  553]
>>>>>      bw (KB  /s): min=    0, max=  694, per=1.46%, avg=383.29,
>>>>> stdev=148.01 lat (usec) : 1000=0.01%
>>>>>      lat (msec) : 2=0.12%, 4=0.63%, 10=4.82%, 20=22.33%, 50=63.97%
>>>>>      lat (msec) : 100=5.61%, 250=0.47%, 500=2.01%, 750=0.08%
>>>>>    cpu          : usr=0.69%, sys=2.57%, ctx=1525021, majf=0,
>>>>> minf=2405 IO depths    : 1=1.1%, 2=0.6%, 4=335.8%, 8=0.0%, 16=0.0%,
>>>>> 32=0.0%,
>>>>>> =64=0.0%
>>>>>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>>>> 64=0.0%,
>>>>>> =64=0.0%
>>>>>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>>>> 64=0.0%,
>>>>>> =64=0.0%
>>>>>       issued    : total=r=0/w=655015/d=0, short=r=0/w=0/d=0
>>>>>       latency   : target=0, window=0, percentile=100.00%, depth=4
>>>>>
>>>>> Run status group 0 (all jobs):
>>>>>    WRITE: io=2559.5MB, aggrb=26178KB/s, minb=26178KB/s,
>>>>> maxb=26178KB/s, mint=100116msec, maxt=100116msec
>>>>>
>>>>> Disk stats (read/write):
>>>>>    rbd1: ios=0/2408612, merge=0/979004, ticks=0/39436432,
>>>>> in_queue=39459720, util=99.68%
>>>>>
>>>>> - In the second test I only created one large image (4,2T)
>>>>>
>>>>> /dev/rbd1 on /mnt/fiotest/vtest0 type ext4
>>>>> (rw,noatime,nodiratime,data=ordered)
>>>>>
>>>>> fio output:
>>>>>
>>>>> rand-write-4k: (groupid=0, jobs=70): err= 0: pid=8907: Wed Jul  9
>>>>> 13:38:14 2014
>>>>>    write: io=2264.6MB, bw=23143KB/s, iops=5783, runt=100198msec
>>>>>      slat (usec): min=0, max=3099.8K, avg=4131.91, stdev=21388.98
>>>>>      clat (usec): min=850, max=3133.1K, avg=43337.56, stdev=93830.42
>>>>>       lat (usec): min=930, max=3147.5K, avg=48253.22, stdev=100642.53
>>>>>      clat percentiles (msec):
>>>>>       |  1.00th=[    5],  5.00th=[   11], 10.00th=[   14],
>>>>> 20.00th=[   19], | 30.00th=[   24], 40.00th=[   29], 50.00th=[   33],
>>>>> 60.00th=[   36], | 70.00th=[   39], 80.00th=[   43], 90.00th=[   51],
>>>>> 95.00th=[   68], | 99.00th=[  506], 99.50th=[  553], 99.90th=[  717],
>>>>> 99.95th=[  783], | 99.99th=[ 3130]
>>>>>      bw (KB  /s): min=    0, max=  680, per=1.54%, avg=355.39,
>>>>> stdev=156.10 lat (usec) : 1000=0.01%
>>>>>      lat (msec) : 2=0.12%, 4=0.66%, 10=4.21%, 20=17.82%, 50=66.95%
>>>>>      lat (msec) : 100=7.34%, 250=0.78%, 500=1.10%, 750=0.99%,
>>>>> 1000=0.02% lat (msec) : >=2000=0.04%
>>>>>    cpu          : usr=0.65%, sys=2.45%, ctx=1434322, majf=0,
>>>>> minf=2399 IO depths    : 1=0.2%, 2=0.1%, 4=365.4%, 8=0.0%, 16=0.0%,
>>>>> 32=0.0%,
>>>>>> =64=0.0%
>>>>>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>>>> 64=0.0%,
>>>>>> =64=0.0%
>>>>>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>>>> 64=0.0%,
>>>>>> =64=0.0%
>>>>>       issued    : total=r=0/w=579510/d=0, short=r=0/w=0/d=0
>>>>>       latency   : target=0, window=0, percentile=100.00%, depth=4
>>>>>
>>>>> Run status group 0 (all jobs):
>>>>>    WRITE: io=2264.6MB, aggrb=23142KB/s, minb=23142KB/s,
>>>>> maxb=23142KB/s, mint=100198msec, maxt=100198msec
>>>>>
>>>>> Disk stats (read/write):
>>>>>    rbd1: ios=0/2295106, merge=0/926648, ticks=0/39660664,
>>>>> in_queue=39706288, util=99.80%
>>>>>
>>>>>
>>>>>
>>>>> It seems that latency is more stable in the first case.
>>>> So I guess what comes to mind is when you have all of the fio
>>>> processes writing to files on a single file system there's now
>>>> another whole layer of locks and contention.  Not sure how likely
>>>> this is though.  Josh might be able to chime in if there's something
>>>> on the RBD side that could slow this kind of use case down.
>>>>
>>>>>> In case 3, do you have multiple fio jobs going or just 1?
>>>>> In all three cases, I am using one fio process with NUMJOBS=70
>>>> Is RBD cache enabled?  It's interesting that librbd is so much slower
>>>> in this case than kernel RBD for you.  If anything I would have
>>>> expected the opposite.
>>>>
>>> Come again?
>>> User space RBD with the default values will have little to no impact in
>>> this scenario.
>>>
>>> Whereas kernel space RBD will be able to use every last byte of memory
>>> for page cache, totally ousting users pace RBD.
>>>
>>> Regards,
>>>
>>> Christian
>> Hi Cristian!
>>
>> I am using "direct=1" with fio in all tests, this should not bypass the
>> page cache?
>>
> It should and will do that inside the VM, but the RBD cache is outside of
> that.
> In the case of kernel space RBD and writeback caching enabled on the VM
> (KVM/qemu) the page cache of the HOST is being used for RBD caching,
> something you should be able to see easily when looking at your memory
> usage (buffers) when testing with large datasets. 
>
> Christian
I am using kernel rbd module in a KVM VM, but the rbd device is mounted
inside the VM, so the HOST is not aware of those IOPS generated by the
VM, because the VM is talking directly with the OSDs and the only
possible used page cache is the one from the VM, but it should be bypassed.

I think you have thought that I was running the test in a VM disk backed
by a rbd device in the HOST, isn't it?, but it is not the case. And this
is why I am not understanding these differences between rbd kernel and
librados with fio.

BR,
Xabier

>
>> Best Regards,
>> Xabier
>>
>>>>>>> thanks in advance or any help,
>>>>>>> Xabier
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users at lists.ceph.com
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users at lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users at lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>
>