performance tests

mark.nelson@xxxxxxxxxxx (Mark Nelson) · Thu, 10 Jul 2014 08:03:46 -0500

On 07/10/2014 03:24 AM, Xabier Elkano wrote:
> El 10/07/14 09:18, Christian Balzer escribi?:
>> On Thu, 10 Jul 2014 08:57:56 +0200 Xabier Elkano wrote:
>>
>>> El 09/07/14 16:53, Christian Balzer escribi?:
>>>> On Wed, 09 Jul 2014 07:07:50 -0500 Mark Nelson wrote:
>>>>
>>>>> On 07/09/2014 06:52 AM, Xabier Elkano wrote:
>>>>>> El 09/07/14 13:10, Mark Nelson escribi?:
>>>>>>> On 07/09/2014 05:57 AM, Xabier Elkano wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I was doing some tests in my cluster with fio tool, one fio
>>>>>>>> instance with 70 jobs, each job writing 1GB random with 4K block
>>>>>>>> size. I did this test with 3 variations:
>>>>>>>>
>>>>>>>> 1- Creating 70 images, 60GB each, in the pool. Using rbd kernel
>>>>>>>> module, format and mount each image as ext4. Each fio job writing
>>>>>>>> in a separate image/directory. (ioengine=libaio, queue_depth=4,
>>>>>>>> direct=1)
>>>>>>>>
>>>>>>>>       IOPS: 6542
>>>>>>>>       AVG LAT: 41ms
>>>>>>>>
>>>>>>>> 2- Creating 1 large image 4,2TB in the pool. Using rbd kernel
>>>>>>>> module, format and mount the image as ext4. Each fio job writing
>>>>>>>> in a separate file in the same directory. (ioengine=libaio,
>>>>>>>> queue_depth=4,direct=1)
>>>>>>>>
>>>>>>>>      IOPS: 5899
>>>>>>>>      AVG LAT:  47ms
>>>>>>>>
>>>>>>>> 3- Creating 1 large image 4,2TB in the pool. Use ioengine rbd in
>>>>>>>> fio to access the image through librados. (ioengine=rbd,
>>>>>>>> queue_depth=4,direct=1)
>>>>>>>>
>>>>>>>>      IOPS: 2638
>>>>>>>>      AVG LAT: 96ms
>>>>>>>>
>>>>>>>> Do these results make sense? From Ceph perspective, It is better to
>>>>>>>> have many small images than a larger one? What is the best approach
>>>>>>>> to simulate the workload of 70 VMs?
>>>>>>> I'm not sure the difference between the first two cases is enough to
>>>>>>> say much yet.  You may need to repeat the test a couple of times to
>>>>>>> ensure that the difference is more than noise.  having said that, if
>>>>>>> we are seeing an effect, it would be interesting to know what the
>>>>>>> latency distribution is like.  is it consistently worse in the 2nd
>>>>>>> case or do we see higher spikes at specific times?
>>>>>>>
>>>>>> I've repeated the tests with similar results. Each test is done with
>>>>>> a clean new rbd image, first removing any existing images in the
>>>>>> pool and then creating the new image. Between tests I am running:
>>>>>>
>>>>>>    echo 3 > /proc/sys/vm/drop_caches
>>>>>>
>>>>>> - In the first test I've created 70 images (60G) and mounted them:
>>>>>>
>>>>>> /dev/rbd1 on /mnt/fiotest/vtest0
>>>>>> /dev/rbd2 on /mnt/fiotest/vtest1
>>>>>> ..
>>>>>> /dev/rbd70 on /mnt/fiotest/vtest69
>>>>>>
>>>>>> fio output:
>>>>>>
>>>>>> rand-write-4k: (groupid=0, jobs=70): err= 0: pid=21852: Tue Jul  8
>>>>>> 14:52:56 2014
>>>>>>     write: io=2559.5MB, bw=26179KB/s, iops=6542, runt=100116msec
>>>>>>       slat (usec): min=18, max=512646, avg=4002.62, stdev=13754.33
>>>>>>       clat (usec): min=867, max=579715, avg=37581.64, stdev=55954.19
>>>>>>        lat (usec): min=903, max=586022, avg=41957.74, stdev=59276.40
>>>>>>       clat percentiles (msec):
>>>>>>        |  1.00th=[    5],  5.00th=[   10], 10.00th=[   13],
>>>>>> 20.00th=[   18], | 30.00th=[   21], 40.00th=[   26], 50.00th=[   31],
>>>>>> 60.00th=[   34], | 70.00th=[   37], 80.00th=[   41], 90.00th=[   48],
>>>>>> 95.00th=[   61], | 99.00th=[  404], 99.50th=[  445], 99.90th=[  494],
>>>>>> 99.95th=[  515], | 99.99th=[  553]
>>>>>>       bw (KB  /s): min=    0, max=  694, per=1.46%, avg=383.29,
>>>>>> stdev=148.01 lat (usec) : 1000=0.01%
>>>>>>       lat (msec) : 2=0.12%, 4=0.63%, 10=4.82%, 20=22.33%, 50=63.97%
>>>>>>       lat (msec) : 100=5.61%, 250=0.47%, 500=2.01%, 750=0.08%
>>>>>>     cpu          : usr=0.69%, sys=2.57%, ctx=1525021, majf=0,
>>>>>> minf=2405 IO depths    : 1=1.1%, 2=0.6%, 4=335.8%, 8=0.0%, 16=0.0%,
>>>>>> 32=0.0%,
>>>>>>> =64=0.0%
>>>>>>        submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>>>>> 64=0.0%,
>>>>>>> =64=0.0%
>>>>>>        complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>>>>> 64=0.0%,
>>>>>>> =64=0.0%
>>>>>>        issued    : total=r=0/w=655015/d=0, short=r=0/w=0/d=0
>>>>>>        latency   : target=0, window=0, percentile=100.00%, depth=4
>>>>>>
>>>>>> Run status group 0 (all jobs):
>>>>>>     WRITE: io=2559.5MB, aggrb=26178KB/s, minb=26178KB/s,
>>>>>> maxb=26178KB/s, mint=100116msec, maxt=100116msec
>>>>>>
>>>>>> Disk stats (read/write):
>>>>>>     rbd1: ios=0/2408612, merge=0/979004, ticks=0/39436432,
>>>>>> in_queue=39459720, util=99.68%
>>>>>>
>>>>>> - In the second test I only created one large image (4,2T)
>>>>>>
>>>>>> /dev/rbd1 on /mnt/fiotest/vtest0 type ext4
>>>>>> (rw,noatime,nodiratime,data=ordered)
>>>>>>
>>>>>> fio output:
>>>>>>
>>>>>> rand-write-4k: (groupid=0, jobs=70): err= 0: pid=8907: Wed Jul  9
>>>>>> 13:38:14 2014
>>>>>>     write: io=2264.6MB, bw=23143KB/s, iops=5783, runt=100198msec
>>>>>>       slat (usec): min=0, max=3099.8K, avg=4131.91, stdev=21388.98
>>>>>>       clat (usec): min=850, max=3133.1K, avg=43337.56, stdev=93830.42
>>>>>>        lat (usec): min=930, max=3147.5K, avg=48253.22, stdev=100642.53
>>>>>>       clat percentiles (msec):
>>>>>>        |  1.00th=[    5],  5.00th=[   11], 10.00th=[   14],
>>>>>> 20.00th=[   19], | 30.00th=[   24], 40.00th=[   29], 50.00th=[   33],
>>>>>> 60.00th=[   36], | 70.00th=[   39], 80.00th=[   43], 90.00th=[   51],
>>>>>> 95.00th=[   68], | 99.00th=[  506], 99.50th=[  553], 99.90th=[  717],
>>>>>> 99.95th=[  783], | 99.99th=[ 3130]
>>>>>>       bw (KB  /s): min=    0, max=  680, per=1.54%, avg=355.39,
>>>>>> stdev=156.10 lat (usec) : 1000=0.01%
>>>>>>       lat (msec) : 2=0.12%, 4=0.66%, 10=4.21%, 20=17.82%, 50=66.95%
>>>>>>       lat (msec) : 100=7.34%, 250=0.78%, 500=1.10%, 750=0.99%,
>>>>>> 1000=0.02% lat (msec) : >=2000=0.04%
>>>>>>     cpu          : usr=0.65%, sys=2.45%, ctx=1434322, majf=0,
>>>>>> minf=2399 IO depths    : 1=0.2%, 2=0.1%, 4=365.4%, 8=0.0%, 16=0.0%,
>>>>>> 32=0.0%,
>>>>>>> =64=0.0%
>>>>>>        submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>>>>> 64=0.0%,
>>>>>>> =64=0.0%
>>>>>>        complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>>>>> 64=0.0%,
>>>>>>> =64=0.0%
>>>>>>        issued    : total=r=0/w=579510/d=0, short=r=0/w=0/d=0
>>>>>>        latency   : target=0, window=0, percentile=100.00%, depth=4
>>>>>>
>>>>>> Run status group 0 (all jobs):
>>>>>>     WRITE: io=2264.6MB, aggrb=23142KB/s, minb=23142KB/s,
>>>>>> maxb=23142KB/s, mint=100198msec, maxt=100198msec
>>>>>>
>>>>>> Disk stats (read/write):
>>>>>>     rbd1: ios=0/2295106, merge=0/926648, ticks=0/39660664,
>>>>>> in_queue=39706288, util=99.80%
>>>>>>
>>>>>>
>>>>>>
>>>>>> It seems that latency is more stable in the first case.
>>>>> So I guess what comes to mind is when you have all of the fio
>>>>> processes writing to files on a single file system there's now
>>>>> another whole layer of locks and contention.  Not sure how likely
>>>>> this is though.  Josh might be able to chime in if there's something
>>>>> on the RBD side that could slow this kind of use case down.
>>>>>
>>>>>>> In case 3, do you have multiple fio jobs going or just 1?
>>>>>> In all three cases, I am using one fio process with NUMJOBS=70
>>>>> Is RBD cache enabled?  It's interesting that librbd is so much slower
>>>>> in this case than kernel RBD for you.  If anything I would have
>>>>> expected the opposite.
>>>>>
>>>> Come again?
>>>> User space RBD with the default values will have little to no impact in
>>>> this scenario.
>>>>
>>>> Whereas kernel space RBD will be able to use every last byte of memory
>>>> for page cache, totally ousting users pace RBD.
>>>>
>>>> Regards,
>>>>
>>>> Christian
>>> Hi Cristian!
>>>
>>> I am using "direct=1" with fio in all tests, this should not bypass the
>>> page cache?
>>>
>> It should and will do that inside the VM, but the RBD cache is outside of
>> that.
>> In the case of kernel space RBD and writeback caching enabled on the VM
>> (KVM/qemu) the page cache of the HOST is being used for RBD caching,
>> something you should be able to see easily when looking at your memory
>> usage (buffers) when testing with large datasets.
>>
>> Christian
> I am using kernel rbd module in a KVM VM, but the rbd device is mounted
> inside the VM, so the HOST is not aware of those IOPS generated by the
> VM, because the VM is talking directly with the OSDs and the only
> possible used page cache is the one from the VM, but it should be bypassed.
>
> I think you have thought that I was running the test in a VM disk backed
> by a rbd device in the HOST, isn't it?, but it is not the case. And this
> is why I am not understanding these differences between rbd kernel and
> librados with fio.

The write path and code involved is different, especially if you are 
using RBD cache.  You might try disabling it just to see what happens.

I wonder also if you might want to test the same filesystem mounted on a 
volume with qemu/kvm with librbd.  Perhaps there is something else going 
on that we don't understand yet.

>
> BR,
> Xabier
>
>>
>>> Best Regards,
>>> Xabier
>>>
>>>>>>>> thanks in advance or any help,
>>>>>>>> Xabier
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-users at lists.ceph.com
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users at lists.ceph.com
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users at lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>
>>
>