That’s probably because the krbd version you are using doesn’t have the TCP_NODELAY patch. We have submitted it (and you can build it from latest rbd source)
, but, I am not sure when it will be in linux mainline. Thanks & Regards Somnath From: Rafael Lopez [mailto:rafael.lopez@xxxxxxxxxx]
Ok I ran the two tests again with direct=1, smaller block size (4k) and smaller total io (100m), disabled cache at ceph.conf side on client by adding: [client] rbd cache = false rbd cache max dirty = 0 rbd cache size = 0 rbd cache target dirty = 0 The result seems to have swapped around, now the librbd job is running ~50% faster than the krbd job! ####### krbd job: [root@rcprsdc1r72-01-ac rafaell]# fio ext4_test job1: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=16 fio-2.2.8 Starting 1 process Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/571KB/0KB /s] [0/142/0 iops] [eta 00m:00s] job1: (groupid=0, jobs=1): err= 0: pid=29095: Fri Sep 11 14:48:21 2015 write: io=102400KB, bw=647137B/s, iops=157, runt=162033msec clat (msec): min=2, max=25, avg= 6.32, stdev= 1.21 lat (msec): min=2, max=25, avg= 6.32, stdev= 1.21 clat percentiles (usec): | 1.00th=[ 2896], 5.00th=[ 4320], 10.00th=[ 4768], 20.00th=[ 5536], | 30.00th=[ 5920], 40.00th=[ 6176], 50.00th=[ 6432], 60.00th=[ 6624], | 70.00th=[ 6816], 80.00th=[ 7136], 90.00th=[ 7584], 95.00th=[ 7968], | 99.00th=[ 9024], 99.50th=[ 9664], 99.90th=[15808], 99.95th=[17536], | 99.99th=[19328] bw (KB /s): min= 506, max= 1171, per=100.00%, avg=632.22, stdev=104.77 lat (msec) : 4=2.88%, 10=96.69%, 20=0.43%, 50=0.01% cpu : usr=0.17%, sys=0.71%, ctx=25634, majf=0, minf=35 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=25600/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=16 Run status group 0 (all jobs): WRITE: io=102400KB, aggrb=631KB/s, minb=631KB/s, maxb=631KB/s, mint=162033msec, maxt=162033msec Disk stats (read/write): rbd0: ios=0/25638, merge=0/32, ticks=0/160765, in_queue=160745, util=99.11% [root@rcprsdc1r72-01-ac rafaell]# ###### librb job: [root@rcprsdc1r72-01-ac rafaell]# fio fio_rbd_test job1: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=16 fio-2.2.8 Starting 1 process rbd engine: RBD version: 0.1.9 Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/703KB/0KB /s] [0/175/0 iops] [eta 00m:00s] job1: (groupid=0, jobs=1): err= 0: pid=30568: Fri Sep 11 14:50:24 2015 write: io=102400KB, bw=950141B/s, iops=231, runt=110360msec slat (usec): min=70, max=992, avg=115.05, stdev=30.07 clat (msec): min=13, max=117, avg=67.91, stdev=24.93 lat (msec): min=13, max=117, avg=68.03, stdev=24.93 clat percentiles (msec): | 1.00th=[ 19], 5.00th=[ 26], 10.00th=[ 38], 20.00th=[ 40], | 30.00th=[ 46], 40.00th=[ 62], 50.00th=[ 77], 60.00th=[ 85], | 70.00th=[ 88], 80.00th=[ 91], 90.00th=[ 95], 95.00th=[ 99], | 99.00th=[ 105], 99.50th=[ 110], 99.90th=[ 116], 99.95th=[ 117], | 99.99th=[ 118] bw (KB /s): min= 565, max= 3174, per=100.00%, avg=935.74, stdev=407.67 lat (msec) : 20=2.41%, 50=29.85%, 100=64.46%, 250=3.29% cpu : usr=2.43%, sys=0.29%, ctx=7847, majf=0, minf=2750 IO depths : 1=6.2%, 2=12.5%, 4=25.0%, 8=50.0%, 16=6.2%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=94.1%, 8=0.0%, 16=5.9%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=25600/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=16 Run status group 0 (all jobs): WRITE: io=102400KB, aggrb=927KB/s, minb=927KB/s, maxb=927KB/s, mint=110360msec, maxt=110360msec Disk stats (read/write): dm-1: ios=240/369, merge=0/0, ticks=742/40, in_queue=782, util=0.38%, aggrios=240/379, aggrmerge=0/19, aggrticks=742/41, aggrin_queue=783, aggrutil=0.39% sda: ios=240/379, merge=0/19, ticks=742/41, in_queue=783, util=0.39% [root@rcprsdc1r72-01-ac rafaell]# Confirmed speed (at least for krbd) using dd: [root@rcprsdc1r72-01-ac rafaell]# dd if=/mnt/ssd/random100g of=/mnt/rbd/dd_io_test bs=4k count=10000 oflag=direct 10000+0 records in 10000+0 records out 40960000 bytes (41 MB) copied, 64.9799 s, 630 kB/s [root@rcprsdc1r72-01-ac rafaell]# Back to FIO, it's worse for 1M block size (librbd is about ~100% better perf). 1M librbd: Run status group 0 (all jobs): WRITE: io=1024.0MB, aggrb=112641KB/s, minb=112641KB/s, maxb=112641KB/s, mint=9309msec, maxt=9309msec 1M krbd: Run status group 0 (all jobs): WRITE: io=1024.0MB, aggrb=49939KB/s, minb=49939KB/s, maxb=49939KB/s, mint=20997msec, maxt=20997msec Raf On 11 September 2015 at 14:33, Somnath Roy <Somnath.Roy@xxxxxxxxxxx> wrote: Only changing client side ceph.conf and rerunning the tests is sufficient. Thanks & Regards Somnath From: Rafael Lopez [mailto:rafael.lopez@xxxxxxxxxx]
Thanks for the quick reply Somnath, will give this a try. In order to set the rbd cache settings, is it a matter of updating the ceph.conf file on the client only prior to running the test, or do I need to inject args to all OSDs ? Raf On 11 September 2015 at 13:39, Somnath Roy <Somnath.Roy@xxxxxxxxxxx> wrote: It may be due to rbd cache effect.. Try the following.. Run your test with direct = 1 both the cases and rbd_cache = false (disable all other rbd cache
option as well). This should give you similar result like krbd. In direct =1 case, we saw ~10-20% degradation if we make rbd_cache = true. But, direct = 0 case, it could be more as you are seeing.. I think there is a delta (or need to tune properly) if you want to use rbd cache. Thanks & Regards Somnath From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
On Behalf Of Rafael Lopez Hi all, I am seeing a big discrepancy between librbd and kRBD/ext4 performance using FIO with single RBD image. RBD images are coming from same RBD pool, same size and settings for both.
The librbd results are quite bad by comparison, and in addition if I scale up the kRBD FIO job with more jobs/threads it increases up to 3-4x results below, but librbd doesn't seem to scale much at all. I figured that it should be close to the kRBD result
for a single job/thread before parallelism comes into play though. RBD cache settings are all default. I can see some obvious differences in FIO output, but not being well versed with FIO I'm not sure what to make of it or where to start diagnosing the discrepancy. Hunted around
but haven't found anything useful, any suggestions/insights would be appreciated. RBD cache settings: [root@rcmktdc1r72-09-ac rafaell]# ceph --admin-daemon /var/run/ceph/ceph-osd.659.asok config show | grep rbd_cache "rbd_cache": "true", "rbd_cache_writethrough_until_flush": "true", "rbd_cache_size": "33554432", "rbd_cache_max_dirty": "25165824", "rbd_cache_target_dirty": "16777216", "rbd_cache_max_dirty_age": "1", "rbd_cache_max_dirty_object": "0", "rbd_cache_block_writes_upfront": "false", [root@rcmktdc1r72-09-ac rafaell]# This is the FIO job file for the kRBD job: [root@rcprsdc1r72-01-ac rafaell]# cat ext4_test ; -- start job file -- [global] rw=rw size=100g filename=/mnt/rbd/fio_test_file_ext4 rwmixread=0 rwmixwrite=100 percentage_random=0 bs=1024k direct=0 iodepth=16 thread=1 numjobs=1 [job1] ; -- end job file -- [root@rcprsdc1r72-01-ac rafaell]# This is the FIO job file for the librbd job: [root@rcprsdc1r72-01-ac rafaell]# cat fio_rbd_test ; -- start job file -- [global] rw=rw size=100g rwmixread=0 rwmixwrite=100 percentage_random=0 bs=1024k direct=0 iodepth=16 thread=1 numjobs=1 ioengine=rbd rbdname=nas1-rds-stg31 pool=rbd [job1] ; -- end job file -- Here are the results: [root@rcprsdc1r72-01-ac rafaell]# fio ext4_test job1: (g=0): rw=rw, bs=1M-1M/1M-1M/1M-1M, ioengine=sync, iodepth=16 fio-2.2.8 Starting 1 thread job1: Laying out IO file(s) (1 file(s) / 102400MB) Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/321.7MB/0KB /s] [0/321/0 iops] [eta 00m:00s] job1: (groupid=0, jobs=1): err= 0: pid=37981: Fri Sep 11 12:33:13 2015 write: io=102400MB, bw=399741KB/s, iops=390, runt=262314msec clat (usec): min=411, max=574082, avg=2492.91, stdev=7316.96 lat (usec): min=418, max=574113, avg=2520.12, stdev=7318.53 clat percentiles (usec): | 1.00th=[ 446], 5.00th=[ 458], 10.00th=[ 474], 20.00th=[ 510], | 30.00th=[ 1064], 40.00th=[ 1096], 50.00th=[ 1160], 60.00th=[ 1320], | 70.00th=[ 1592], 80.00th=[ 2448], 90.00th=[ 7712], 95.00th=[ 7904], | 99.00th=[11072], 99.50th=[11712], 99.90th=[13120], 99.95th=[73216], | 99.99th=[464896] bw (KB /s): min= 264, max=2156544, per=100.00%, avg=412986.27, stdev=375092.66 lat (usec) : 500=18.68%, 750=7.43%, 1000=2.11% lat (msec) : 2=48.89%, 4=4.35%, 10=16.79%, 20=1.67%, 50=0.03% lat (msec) : 100=0.03%, 250=0.02%, 500=0.01%, 750=0.01% cpu : usr=1.24%, sys=45.38%, ctx=19298, majf=0, minf=974 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=102400/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=16 Run status group 0 (all jobs): WRITE: io=102400MB, aggrb=399740KB/s, minb=399740KB/s, maxb=399740KB/s, mint=262314msec, maxt=262314msec Disk stats (read/write): rbd0: ios=0/150890, merge=0/49, ticks=0/36117700, in_queue=36145277, util=96.97% [root@rcprsdc1r72-01-ac rafaell]# [root@rcprsdc1r72-01-ac rafaell]# fio fio_rbd_test job1: (g=0): rw=rw, bs=1M-1M/1M-1M/1M-1M, ioengine=rbd, iodepth=16 fio-2.2.8 Starting 1 thread rbd engine: RBD version: 0.1.9 Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/65405KB/0KB /s] [0/63/0 iops] [eta 00m:00s] job1: (groupid=0, jobs=1): err= 0: pid=43960: Fri Sep 11 12:54:25 2015 write: io=102400MB, bw=121882KB/s, iops=119, runt=860318msec slat (usec): min=355, max=7300, avg=908.97, stdev=361.02 clat (msec): min=11, max=1468, avg=129.59, stdev=130.68 lat (msec): min=12, max=1468, avg=130.50, stdev=130.69 clat percentiles (msec): | 1.00th=[ 21], 5.00th=[ 26], 10.00th=[ 29], 20.00th=[ 34], | 30.00th=[ 37], 40.00th=[ 40], 50.00th=[ 44], 60.00th=[ 63], | 70.00th=[ 233], 80.00th=[ 241], 90.00th=[ 269], 95.00th=[ 367], | 99.00th=[ 553], 99.50th=[ 652], 99.90th=[ 832], 99.95th=[ 848], | 99.99th=[ 1369] bw (KB /s): min=20363, max=248543, per=100.00%, avg=124381.19, stdev=42313.29 lat (msec) : 20=0.95%, 50=55.27%, 100=5.55%, 250=24.83%, 500=12.28% lat (msec) : 750=0.89%, 1000=0.21%, 2000=0.01% cpu : usr=9.58%, sys=1.15%, ctx=23883, majf=0, minf=2751023 IO depths : 1=1.2%, 2=3.0%, 4=9.7%, 8=68.3%, 16=17.8%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=92.5%, 8=4.3%, 16=3.2%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=102400/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=16 Run status group 0 (all jobs): WRITE: io=102400MB, aggrb=121882KB/s, minb=121882KB/s, maxb=121882KB/s, mint=860318msec, maxt=860318msec Disk stats (read/write): dm-1: ios=0/2072, merge=0/0, ticks=0/233, in_queue=233, util=0.01%, aggrios=1/2249, aggrmerge=7/559, aggrticks=9/254, aggrin_queue=261, aggrutil=0.01% sda: ios=1/2249, merge=7/559, ticks=9/254, in_queue=261, util=0.01% [root@rcprsdc1r72-01-ac rafaell]# Cheers, Raf --
Rafael Lopez Data Storage Administrator
--
Rafael Lopez Data Storage Administrator +61 3 990 59118
-- Rafael Lopez Data Storage Administrator +61 3 990 59118 |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com