El 09/07/14 14:07, Mark Nelson escribi?: > On 07/09/2014 06:52 AM, Xabier Elkano wrote: >> El 09/07/14 13:10, Mark Nelson escribi?: >>> On 07/09/2014 05:57 AM, Xabier Elkano wrote: >>>> >>>> >>>> Hi, >>>> >>>> I was doing some tests in my cluster with fio tool, one fio instance >>>> with 70 jobs, each job writing 1GB random with 4K block size. I did >>>> this >>>> test with 3 variations: >>>> >>>> 1- Creating 70 images, 60GB each, in the pool. Using rbd kernel >>>> module, >>>> format and mount each image as ext4. Each fio job writing in a >>>> separate >>>> image/directory. (ioengine=libaio, queue_depth=4, direct=1) >>>> >>>> IOPS: 6542 >>>> AVG LAT: 41ms >>>> >>>> 2- Creating 1 large image 4,2TB in the pool. Using rbd kernel module, >>>> format and mount the image as ext4. Each fio job writing in a separate >>>> file in the same directory. (ioengine=libaio, queue_depth=4,direct=1) >>>> >>>> IOPS: 5899 >>>> AVG LAT: 47ms >>>> >>>> 3- Creating 1 large image 4,2TB in the pool. Use ioengine rbd in >>>> fio to >>>> access the image through librados. (ioengine=rbd, >>>> queue_depth=4,direct=1) >>>> >>>> IOPS: 2638 >>>> AVG LAT: 96ms >>>> >>>> Do these results make sense? From Ceph perspective, It is better to >>>> have >>>> many small images than a larger one? What is the best approach to >>>> simulate the workload of 70 VMs? >>> >>> I'm not sure the difference between the first two cases is enough to >>> say much yet. You may need to repeat the test a couple of times to >>> ensure that the difference is more than noise. having said that, if >>> we are seeing an effect, it would be interesting to know what the >>> latency distribution is like. is it consistently worse in the 2nd >>> case or do we see higher spikes at specific times? >>> >> I've repeated the tests with similar results. Each test is done with a >> clean new rbd image, first removing any existing images in the pool and >> then creating the new image. Between tests I am running: >> >> echo 3 > /proc/sys/vm/drop_caches >> >> - In the first test I've created 70 images (60G) and mounted them: >> >> /dev/rbd1 on /mnt/fiotest/vtest0 >> /dev/rbd2 on /mnt/fiotest/vtest1 >> .. >> /dev/rbd70 on /mnt/fiotest/vtest69 >> >> fio output: >> >> rand-write-4k: (groupid=0, jobs=70): err= 0: pid=21852: Tue Jul 8 >> 14:52:56 2014 >> write: io=2559.5MB, bw=26179KB/s, iops=6542, runt=100116msec >> slat (usec): min=18, max=512646, avg=4002.62, stdev=13754.33 >> clat (usec): min=867, max=579715, avg=37581.64, stdev=55954.19 >> lat (usec): min=903, max=586022, avg=41957.74, stdev=59276.40 >> clat percentiles (msec): >> | 1.00th=[ 5], 5.00th=[ 10], 10.00th=[ 13], >> 20.00th=[ 18], >> | 30.00th=[ 21], 40.00th=[ 26], 50.00th=[ 31], >> 60.00th=[ 34], >> | 70.00th=[ 37], 80.00th=[ 41], 90.00th=[ 48], >> 95.00th=[ 61], >> | 99.00th=[ 404], 99.50th=[ 445], 99.90th=[ 494], 99.95th=[ >> 515], >> | 99.99th=[ 553] >> bw (KB /s): min= 0, max= 694, per=1.46%, avg=383.29, >> stdev=148.01 >> lat (usec) : 1000=0.01% >> lat (msec) : 2=0.12%, 4=0.63%, 10=4.82%, 20=22.33%, 50=63.97% >> lat (msec) : 100=5.61%, 250=0.47%, 500=2.01%, 750=0.08% >> cpu : usr=0.69%, sys=2.57%, ctx=1525021, majf=0, minf=2405 >> IO depths : 1=1.1%, 2=0.6%, 4=335.8%, 8=0.0%, 16=0.0%, 32=0.0%, >>> =64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >>> =64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >>> =64=0.0% >> issued : total=r=0/w=655015/d=0, short=r=0/w=0/d=0 >> latency : target=0, window=0, percentile=100.00%, depth=4 >> >> Run status group 0 (all jobs): >> WRITE: io=2559.5MB, aggrb=26178KB/s, minb=26178KB/s, maxb=26178KB/s, >> mint=100116msec, maxt=100116msec >> >> Disk stats (read/write): >> rbd1: ios=0/2408612, merge=0/979004, ticks=0/39436432, >> in_queue=39459720, util=99.68% >> >> - In the second test I only created one large image (4,2T) >> >> /dev/rbd1 on /mnt/fiotest/vtest0 type ext4 >> (rw,noatime,nodiratime,data=ordered) >> >> fio output: >> >> rand-write-4k: (groupid=0, jobs=70): err= 0: pid=8907: Wed Jul 9 >> 13:38:14 2014 >> write: io=2264.6MB, bw=23143KB/s, iops=5783, runt=100198msec >> slat (usec): min=0, max=3099.8K, avg=4131.91, stdev=21388.98 >> clat (usec): min=850, max=3133.1K, avg=43337.56, stdev=93830.42 >> lat (usec): min=930, max=3147.5K, avg=48253.22, stdev=100642.53 >> clat percentiles (msec): >> | 1.00th=[ 5], 5.00th=[ 11], 10.00th=[ 14], >> 20.00th=[ 19], >> | 30.00th=[ 24], 40.00th=[ 29], 50.00th=[ 33], >> 60.00th=[ 36], >> | 70.00th=[ 39], 80.00th=[ 43], 90.00th=[ 51], >> 95.00th=[ 68], >> | 99.00th=[ 506], 99.50th=[ 553], 99.90th=[ 717], 99.95th=[ >> 783], >> | 99.99th=[ 3130] >> bw (KB /s): min= 0, max= 680, per=1.54%, avg=355.39, >> stdev=156.10 >> lat (usec) : 1000=0.01% >> lat (msec) : 2=0.12%, 4=0.66%, 10=4.21%, 20=17.82%, 50=66.95% >> lat (msec) : 100=7.34%, 250=0.78%, 500=1.10%, 750=0.99%, 1000=0.02% >> lat (msec) : >=2000=0.04% >> cpu : usr=0.65%, sys=2.45%, ctx=1434322, majf=0, minf=2399 >> IO depths : 1=0.2%, 2=0.1%, 4=365.4%, 8=0.0%, 16=0.0%, 32=0.0%, >>> =64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >>> =64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >>> =64=0.0% >> issued : total=r=0/w=579510/d=0, short=r=0/w=0/d=0 >> latency : target=0, window=0, percentile=100.00%, depth=4 >> >> Run status group 0 (all jobs): >> WRITE: io=2264.6MB, aggrb=23142KB/s, minb=23142KB/s, maxb=23142KB/s, >> mint=100198msec, maxt=100198msec >> >> Disk stats (read/write): >> rbd1: ios=0/2295106, merge=0/926648, ticks=0/39660664, >> in_queue=39706288, util=99.80% >> >> >> >> It seems that latency is more stable in the first case. > > So I guess what comes to mind is when you have all of the fio > processes writing to files on a single file system there's now another > whole layer of locks and contention. Not sure how likely this is > though. Josh might be able to chime in if there's something on the > RBD side that could slow this kind of use case down. Yes, this is true, in the first case there are 70 filesystems with one thread on each and in the second only one FS accessed by 70 threads. I have not thought in the FS layer and that may impose a penalty. Thanks for this ;-) > >> >> >>> In case 3, do you have multiple fio jobs going or just 1? >> In all three cases, I am using one fio process with NUMJOBS=70 > > Is RBD cache enabled? It's interesting that librbd is so much slower > in this case than kernel RBD for you. If anything I would have > expected the opposite. Yes, the rbd cache is enabled with default values: [client] rbd cache = true #rbd cache size = #rbd cache max dirty = #rbd cache target dirty #rbd cache max dirty age = rbd cache writethrough until flush = true > >>> >>>> >>>> >>>> thanks in advance or any help, >>>> Xabier >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users at lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users at lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >