On 07/10/2014 03:24 AM, Xabier Elkano wrote: > El 10/07/14 09:18, Christian Balzer escribi?: >> On Thu, 10 Jul 2014 08:57:56 +0200 Xabier Elkano wrote: >> >>> El 09/07/14 16:53, Christian Balzer escribi?: >>>> On Wed, 09 Jul 2014 07:07:50 -0500 Mark Nelson wrote: >>>> >>>>> On 07/09/2014 06:52 AM, Xabier Elkano wrote: >>>>>> El 09/07/14 13:10, Mark Nelson escribi?: >>>>>>> On 07/09/2014 05:57 AM, Xabier Elkano wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> I was doing some tests in my cluster with fio tool, one fio >>>>>>>> instance with 70 jobs, each job writing 1GB random with 4K block >>>>>>>> size. I did this test with 3 variations: >>>>>>>> >>>>>>>> 1- Creating 70 images, 60GB each, in the pool. Using rbd kernel >>>>>>>> module, format and mount each image as ext4. Each fio job writing >>>>>>>> in a separate image/directory. (ioengine=libaio, queue_depth=4, >>>>>>>> direct=1) >>>>>>>> >>>>>>>> IOPS: 6542 >>>>>>>> AVG LAT: 41ms >>>>>>>> >>>>>>>> 2- Creating 1 large image 4,2TB in the pool. Using rbd kernel >>>>>>>> module, format and mount the image as ext4. Each fio job writing >>>>>>>> in a separate file in the same directory. (ioengine=libaio, >>>>>>>> queue_depth=4,direct=1) >>>>>>>> >>>>>>>> IOPS: 5899 >>>>>>>> AVG LAT: 47ms >>>>>>>> >>>>>>>> 3- Creating 1 large image 4,2TB in the pool. Use ioengine rbd in >>>>>>>> fio to access the image through librados. (ioengine=rbd, >>>>>>>> queue_depth=4,direct=1) >>>>>>>> >>>>>>>> IOPS: 2638 >>>>>>>> AVG LAT: 96ms >>>>>>>> >>>>>>>> Do these results make sense? From Ceph perspective, It is better to >>>>>>>> have many small images than a larger one? What is the best approach >>>>>>>> to simulate the workload of 70 VMs? >>>>>>> I'm not sure the difference between the first two cases is enough to >>>>>>> say much yet. You may need to repeat the test a couple of times to >>>>>>> ensure that the difference is more than noise. having said that, if >>>>>>> we are seeing an effect, it would be interesting to know what the >>>>>>> latency distribution is like. is it consistently worse in the 2nd >>>>>>> case or do we see higher spikes at specific times? >>>>>>> >>>>>> I've repeated the tests with similar results. Each test is done with >>>>>> a clean new rbd image, first removing any existing images in the >>>>>> pool and then creating the new image. Between tests I am running: >>>>>> >>>>>> echo 3 > /proc/sys/vm/drop_caches >>>>>> >>>>>> - In the first test I've created 70 images (60G) and mounted them: >>>>>> >>>>>> /dev/rbd1 on /mnt/fiotest/vtest0 >>>>>> /dev/rbd2 on /mnt/fiotest/vtest1 >>>>>> .. >>>>>> /dev/rbd70 on /mnt/fiotest/vtest69 >>>>>> >>>>>> fio output: >>>>>> >>>>>> rand-write-4k: (groupid=0, jobs=70): err= 0: pid=21852: Tue Jul 8 >>>>>> 14:52:56 2014 >>>>>> write: io=2559.5MB, bw=26179KB/s, iops=6542, runt=100116msec >>>>>> slat (usec): min=18, max=512646, avg=4002.62, stdev=13754.33 >>>>>> clat (usec): min=867, max=579715, avg=37581.64, stdev=55954.19 >>>>>> lat (usec): min=903, max=586022, avg=41957.74, stdev=59276.40 >>>>>> clat percentiles (msec): >>>>>> | 1.00th=[ 5], 5.00th=[ 10], 10.00th=[ 13], >>>>>> 20.00th=[ 18], | 30.00th=[ 21], 40.00th=[ 26], 50.00th=[ 31], >>>>>> 60.00th=[ 34], | 70.00th=[ 37], 80.00th=[ 41], 90.00th=[ 48], >>>>>> 95.00th=[ 61], | 99.00th=[ 404], 99.50th=[ 445], 99.90th=[ 494], >>>>>> 99.95th=[ 515], | 99.99th=[ 553] >>>>>> bw (KB /s): min= 0, max= 694, per=1.46%, avg=383.29, >>>>>> stdev=148.01 lat (usec) : 1000=0.01% >>>>>> lat (msec) : 2=0.12%, 4=0.63%, 10=4.82%, 20=22.33%, 50=63.97% >>>>>> lat (msec) : 100=5.61%, 250=0.47%, 500=2.01%, 750=0.08% >>>>>> cpu : usr=0.69%, sys=2.57%, ctx=1525021, majf=0, >>>>>> minf=2405 IO depths : 1=1.1%, 2=0.6%, 4=335.8%, 8=0.0%, 16=0.0%, >>>>>> 32=0.0%, >>>>>>> =64=0.0% >>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >>>>>> 64=0.0%, >>>>>>> =64=0.0% >>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >>>>>> 64=0.0%, >>>>>>> =64=0.0% >>>>>> issued : total=r=0/w=655015/d=0, short=r=0/w=0/d=0 >>>>>> latency : target=0, window=0, percentile=100.00%, depth=4 >>>>>> >>>>>> Run status group 0 (all jobs): >>>>>> WRITE: io=2559.5MB, aggrb=26178KB/s, minb=26178KB/s, >>>>>> maxb=26178KB/s, mint=100116msec, maxt=100116msec >>>>>> >>>>>> Disk stats (read/write): >>>>>> rbd1: ios=0/2408612, merge=0/979004, ticks=0/39436432, >>>>>> in_queue=39459720, util=99.68% >>>>>> >>>>>> - In the second test I only created one large image (4,2T) >>>>>> >>>>>> /dev/rbd1 on /mnt/fiotest/vtest0 type ext4 >>>>>> (rw,noatime,nodiratime,data=ordered) >>>>>> >>>>>> fio output: >>>>>> >>>>>> rand-write-4k: (groupid=0, jobs=70): err= 0: pid=8907: Wed Jul 9 >>>>>> 13:38:14 2014 >>>>>> write: io=2264.6MB, bw=23143KB/s, iops=5783, runt=100198msec >>>>>> slat (usec): min=0, max=3099.8K, avg=4131.91, stdev=21388.98 >>>>>> clat (usec): min=850, max=3133.1K, avg=43337.56, stdev=93830.42 >>>>>> lat (usec): min=930, max=3147.5K, avg=48253.22, stdev=100642.53 >>>>>> clat percentiles (msec): >>>>>> | 1.00th=[ 5], 5.00th=[ 11], 10.00th=[ 14], >>>>>> 20.00th=[ 19], | 30.00th=[ 24], 40.00th=[ 29], 50.00th=[ 33], >>>>>> 60.00th=[ 36], | 70.00th=[ 39], 80.00th=[ 43], 90.00th=[ 51], >>>>>> 95.00th=[ 68], | 99.00th=[ 506], 99.50th=[ 553], 99.90th=[ 717], >>>>>> 99.95th=[ 783], | 99.99th=[ 3130] >>>>>> bw (KB /s): min= 0, max= 680, per=1.54%, avg=355.39, >>>>>> stdev=156.10 lat (usec) : 1000=0.01% >>>>>> lat (msec) : 2=0.12%, 4=0.66%, 10=4.21%, 20=17.82%, 50=66.95% >>>>>> lat (msec) : 100=7.34%, 250=0.78%, 500=1.10%, 750=0.99%, >>>>>> 1000=0.02% lat (msec) : >=2000=0.04% >>>>>> cpu : usr=0.65%, sys=2.45%, ctx=1434322, majf=0, >>>>>> minf=2399 IO depths : 1=0.2%, 2=0.1%, 4=365.4%, 8=0.0%, 16=0.0%, >>>>>> 32=0.0%, >>>>>>> =64=0.0% >>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >>>>>> 64=0.0%, >>>>>>> =64=0.0% >>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >>>>>> 64=0.0%, >>>>>>> =64=0.0% >>>>>> issued : total=r=0/w=579510/d=0, short=r=0/w=0/d=0 >>>>>> latency : target=0, window=0, percentile=100.00%, depth=4 >>>>>> >>>>>> Run status group 0 (all jobs): >>>>>> WRITE: io=2264.6MB, aggrb=23142KB/s, minb=23142KB/s, >>>>>> maxb=23142KB/s, mint=100198msec, maxt=100198msec >>>>>> >>>>>> Disk stats (read/write): >>>>>> rbd1: ios=0/2295106, merge=0/926648, ticks=0/39660664, >>>>>> in_queue=39706288, util=99.80% >>>>>> >>>>>> >>>>>> >>>>>> It seems that latency is more stable in the first case. >>>>> So I guess what comes to mind is when you have all of the fio >>>>> processes writing to files on a single file system there's now >>>>> another whole layer of locks and contention. Not sure how likely >>>>> this is though. Josh might be able to chime in if there's something >>>>> on the RBD side that could slow this kind of use case down. >>>>> >>>>>>> In case 3, do you have multiple fio jobs going or just 1? >>>>>> In all three cases, I am using one fio process with NUMJOBS=70 >>>>> Is RBD cache enabled? It's interesting that librbd is so much slower >>>>> in this case than kernel RBD for you. If anything I would have >>>>> expected the opposite. >>>>> >>>> Come again? >>>> User space RBD with the default values will have little to no impact in >>>> this scenario. >>>> >>>> Whereas kernel space RBD will be able to use every last byte of memory >>>> for page cache, totally ousting users pace RBD. >>>> >>>> Regards, >>>> >>>> Christian >>> Hi Cristian! >>> >>> I am using "direct=1" with fio in all tests, this should not bypass the >>> page cache? >>> >> It should and will do that inside the VM, but the RBD cache is outside of >> that. >> In the case of kernel space RBD and writeback caching enabled on the VM >> (KVM/qemu) the page cache of the HOST is being used for RBD caching, >> something you should be able to see easily when looking at your memory >> usage (buffers) when testing with large datasets. >> >> Christian > I am using kernel rbd module in a KVM VM, but the rbd device is mounted > inside the VM, so the HOST is not aware of those IOPS generated by the > VM, because the VM is talking directly with the OSDs and the only > possible used page cache is the one from the VM, but it should be bypassed. > > I think you have thought that I was running the test in a VM disk backed > by a rbd device in the HOST, isn't it?, but it is not the case. And this > is why I am not understanding these differences between rbd kernel and > librados with fio. The write path and code involved is different, especially if you are using RBD cache. You might try disabling it just to see what happens. I wonder also if you might want to test the same filesystem mounted on a volume with qemu/kvm with librbd. Perhaps there is something else going on that we don't understand yet. > > BR, > Xabier > >> >>> Best Regards, >>> Xabier >>> >>>>>>>> thanks in advance or any help, >>>>>>>> Xabier >>>>>>>> _______________________________________________ >>>>>>>> ceph-users mailing list >>>>>>>> ceph-users at lists.ceph.com >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list >>>>>>> ceph-users at lists.ceph.com >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users at lists.ceph.com >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>> >> >