Re: rbd_cache, limiting read on high iops around 40k

Irek Fasikhov <malmyzh@xxxxxxxxx> · Wed, 17 Jun 2015 09:12:15 +0300

If necessary, there are RPM files for centos 7: 
 gperftools.spec

 pprof-2.4-1.el7.centos.noarch.rpm

 gperftools-libs-2.4-1.el7.centos.x86_64.rpm

 gperftools-devel-2.4-1.el7.centos.x86_64.rpm

 gperftools-debuginfo-2.4-1.el7.centos.x86_64.rpm

 gperftools-2.4-1.el7.centos.x86_64.rpm

2015-06-17 8:01 GMT+03:00 Alexandre DERUMIER <aderumier@xxxxxxxxx>:
Hi,

I finally fix it with tcmalloc with

TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456 LD_PRELOAD} = "/usr/lib/libtcmalloc_minimal.so.4" qemu....

I got almost same result than jemmaloc in this case, maybe a littleb it faster

Here the iops results for 1qemu vm with iothread by disk (iodepth=32, 4krandread, nocache)

qemu randread 4k nocache libc6  iops

1 disk          29052

2 disks         55878

4 disks         127899

8 disks         240566

15 disks        269976

qemu randread 4k nocache jemmaloc       iops

1 disk   41278

2 disks  75781

4 disks  195351

8 disks  294241

15 disks 298199

qemu randread 4k nocache tcmalloc 16M cache     iops

1 disk   37911

2 disks  67698

4 disks  41076

8 disks  43312

15 disks 37569

qemu randread 4k nocache tcmalloc patched 256M  iops

1 disk no-iothread

1 disk   42160

2 disks  83135

4 disks  194591

8 disks  306038

15 disks 302278

----- Mail original -----

De: "aderumier" <aderumier@xxxxxxxxx>

À: "Mark Nelson" <mnelson@xxxxxxxxxx>

Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>

Envoyé: Mardi 16 Juin 2015 20:27:54

Objet: Re:  rbd_cache, limiting read on high iops around 40k

>>I forgot to ask, is this with the patched version of tcmalloc that

>>theoretically fixes the TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES issue?

Yes, the patched version of tcmalloc, but also the last version from gperftools git.

(I'm talking about qemu here, not osds).

I have tried to increased TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES, but it doesn't help.

For osd, increasing TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES is helping.

(Benchs are still running, I try to overload them as much as possible)

----- Mail original -----

De: "Mark Nelson" <mnelson@xxxxxxxxxx>

À: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>

Envoyé: Mardi 16 Juin 2015 19:04:27

Objet: Re:  rbd_cache, limiting read on high iops around 40k

I forgot to ask, is this with the patched version of tcmalloc that

theoretically fixes the TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES issue?

Mark

On 06/16/2015 11:46 AM, Mark Nelson wrote:

> Hi Alexandre,

>

> Excellent find! Have you also informed the QEMU developers of your

> discovery?

>

> Mark

>

> On 06/16/2015 11:38 AM, Alexandre DERUMIER wrote:

>> Hi,

>>

>> some news about qemu with tcmalloc vs jemmaloc.

>>

>> I'm testing with multiple disks (with iothreads) in 1 qemu guest.

>>

>> And if tcmalloc is a little faster than jemmaloc,

>>

>> I have hit a lot of time the

>> tcmalloc::ThreadCache::ReleaseToCentralCache bug.

>>

>> increasing TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES, don't help.

>>

>>

>> with multiple disk, I'm around 200k iops with tcmalloc (before hitting

>> the bug) and 350kiops with jemmaloc.

>>

>> The problem is that when I hit malloc bug, I'm around 4000-10000 iops,

>> and only way to fix is is to restart qemu ...

>>

>>

>>

>> ----- Mail original -----

>> De: "pushpesh sharma" <pushpesh.eck@xxxxxxxxx>

>> À: "aderumier" <aderumier@xxxxxxxxx>

>> Cc: "Somnath Roy" <Somnath.Roy@xxxxxxxxxxx>, "Irek Fasikhov"

>> <malmyzh@xxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx>,

>> "ceph-users" <ceph-users@xxxxxxxxxxxxxx>

>> Envoyé: Vendredi 12 Juin 2015 08:58:21

>> Objet: Re: rbd_cache, limiting read on high iops around 40k

>>

>> Thanks, posted the question in openstack list. Hopefully will get some

>> expert opinion.

>>

>> On Fri, Jun 12, 2015 at 11:33 AM, Alexandre DERUMIER

>> <aderumier@xxxxxxxxx> wrote:

>>> Hi,

>>>

>>> here a libvirt xml sample from libvirt src

>>>

>>> (you need to define <iothreads> number, then assign then in disks).

>>>

>>> I don't use openstack, so I really don't known how it's working with it.

>>>

>>>

>>> <domain type='qemu'>

>>> <name>QEMUGuest1</name>

>>> <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid>

>>> <memory unit='KiB'>219136</memory>

>>> <currentMemory unit='KiB'>219136</currentMemory>

>>> <vcpu placement='static'>2</vcpu>

>>> <iothreads>2</iothreads>

>>> <os>

>>> <type arch='i686' machine='pc'>hvm</type>

>>> <boot dev='hd'/>

>>> </os>

>>> <clock offset='utc'/>

>>> <on_poweroff>destroy</on_poweroff>

>>> <on_reboot>restart</on_reboot>

>>> <on_crash>destroy</on_crash>

>>> <devices>

>>> <emulator>/usr/bin/qemu</emulator>

>>> <disk type='file' device='disk'>

>>> <driver name='qemu' type='raw' iothread='1'/>

>>> <source file='/var/lib/libvirt/images/iothrtest1.img'/>

>>> <target dev='vdb' bus='virtio'/>

>>> <address type='pci' domain='0x0000' bus='0x00' slot='0x04'

>>> function='0x0'/>

>>> </disk>

>>> <disk type='file' device='disk'>

>>> <driver name='qemu' type='raw' iothread='2'/>

>>> <source file='/var/lib/libvirt/images/iothrtest2.img'/>

>>> <target dev='vdc' bus='virtio'/>

>>> </disk>

>>> <controller type='usb' index='0'/>

>>> <controller type='ide' index='0'/>

>>> <controller type='pci' index='0' model='pci-root'/>

>>> <memballoon model='none'/>

>>> </devices>

>>> </domain>

>>>

>>>

>>> ----- Mail original -----

>>> De: "pushpesh sharma" <pushpesh.eck@xxxxxxxxx>

>>> À: "aderumier" <aderumier@xxxxxxxxx>

>>> Cc: "Somnath Roy" <Somnath.Roy@xxxxxxxxxxx>, "Irek Fasikhov"

>>> <malmyzh@xxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx>,

>>> "ceph-users" <ceph-users@xxxxxxxxxxxxxx>

>>> Envoyé: Vendredi 12 Juin 2015 07:52:41

>>> Objet: Re: rbd_cache, limiting read on high iops around 40k

>>>

>>> Hi Alexandre,

>>>

>>> I agree with your rational, of one iothread per disk. CPU consumed in

>>> IOwait is pretty high in each VM. But I am not finding a way to set

>>> the same on a nova instance. I am using openstack Juno with QEMU+KVM.

>>> As per libvirt documentation for setting iothreads, I can edit

>>> domain.xml directly and achieve the same effect. However in as in

>>> openstack env domain xml is created by nova with some additional

>>> metadata, so editing the domain xml using 'virsh edit' does not seems

>>> to work(I agree, it is not a very cloud way of doing things, but a

>>> hack). Changes made there vanish after saving them, due to reason

>>> libvirt validation fails on the same.

>>>

>>> #virsh dumpxml instance-000000c5 > vm.xml

>>> #virt-xml-validate vm.xml

>>> Relax-NG validity error : Extra element cpu in interleave

>>> vm.xml:1: element domain: Relax-NG validity error : Element domain

>>> failed to validate content

>>> vm.xml fails to validate

>>>

>>> Second approach I took was to setting QoS in volumes types. But there

>>> is no option to set iothreads per volume, there are parameter realted

>>> to max_read/wrirte ops/bytes.

>>>

>>> Thirdly, editing Nova flavor and proving extra specs like

>>> hw:cpu_socket/thread/core, can change guest CPU topology however again

>>> no way to set iothread. It does accept hw_disk_iothreads(no type check

>>> in place, i believe ), but can not pass the same in domain.xml.

>>>

>>> Could you suggest me a way to set the same.

>>>

>>> -Pushpesh

>>>

>>> On Wed, Jun 10, 2015 at 12:59 PM, Alexandre DERUMIER

>>> <aderumier@xxxxxxxxx> wrote:

>>>>>> I need to try out the performance on qemu soon and may come back

>>>>>> to you if I need some qemu setting trick :-)

>>>>

>>>> Sure no problem.

>>>>

>>>> (BTW, I can reach around 200k iops in 1 qemu vm with 5 virtio disks

>>>> with 1 iothread by disk)

>>>>

>>>>

>>>> ----- Mail original -----

>>>> De: "Somnath Roy" <Somnath.Roy@xxxxxxxxxxx>

>>>> À: "aderumier" <aderumier@xxxxxxxxx>, "Irek Fasikhov"

>>>> <malmyzh@xxxxxxxxx>

>>>> Cc: "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx>, "pushpesh sharma"

>>>> <pushpesh.eck@xxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx>

>>>> Envoyé: Mercredi 10 Juin 2015 09:06:32

>>>> Objet: RE: rbd_cache, limiting read on high iops around 40k

>>>>

>>>> Hi Alexandre,

>>>> Thanks for sharing the data.

>>>> I need to try out the performance on qemu soon and may come back to

>>>> you if I need some qemu setting trick :-)

>>>>

>>>> Regards

>>>> Somnath

>>>>

>>>> -----Original Message-----

>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On

>>>> Behalf Of Alexandre DERUMIER

>>>> Sent: Tuesday, June 09, 2015 10:42 PM

>>>> To: Irek Fasikhov

>>>> Cc: ceph-devel; pushpesh sharma; ceph-users

>>>> Subject: Re:  rbd_cache, limiting read on high iops

>>>> around 40k

>>>>

>>>>>> Very good work!

>>>>>> Do you have a rpm-file?

>>>>>> Thanks.

>>>> no sorry, I'm have compiled it manually (and I'm using debian jessie

>>>> as client)

>>>>

>>>>

>>>>

>>>> ----- Mail original -----

>>>> De: "Irek Fasikhov" <malmyzh@xxxxxxxxx>

>>>> À: "aderumier" <aderumier@xxxxxxxxx>

>>>> Cc: "Robert LeBlanc" <robert@xxxxxxxxxxxxx>, "ceph-devel"

>>>> <ceph-devel@xxxxxxxxxxxxxxx>, "pushpesh sharma"

>>>> <pushpesh.eck@xxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx>

>>>> Envoyé: Mercredi 10 Juin 2015 07:21:42

>>>> Objet: Re:  rbd_cache, limiting read on high iops around

>>>> 40k

>>>>

>>>> Hi, Alexandre.

>>>>

>>>> Very good work!

>>>> Do you have a rpm-file?

>>>> Thanks.

>>>>

>>>> 2015-06-10 7:10 GMT+03:00 Alexandre DERUMIER < aderumier@xxxxxxxxx > :

>>>>

>>>>

>>>> Hi,

>>>>

>>>> I have tested qemu with last tcmalloc 2.4, and the improvement is

>>>> huge with iothread: 50k iops (+45%) !

>>>>

>>>>

>>>>

>>>> qemu : no iothread : glibc : iops=33395 qemu : no-iothread :

>>>> tcmalloc (2.2.1) : iops=34516 (+3%) qemu : no-iothread : jemmaloc :

>>>> iops=42226 (+26%) qemu : no-iothread : tcmalloc (2.4) : iops=35974

>>>> (+7%)

>>>>

>>>>

>>>> qemu : iothread : glibc : iops=34516

>>>> qemu : iothread : tcmalloc : iops=38676 (+12%) qemu : iothread :

>>>> jemmaloc : iops=28023 (-19%) qemu : iothread : tcmalloc (2.4) :

>>>> iops=50276 (+45%)

>>>>

>>>>

>>>>

>>>>

>>>>

>>>> qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%)

>>>> ------------------------------------------------------

>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,

>>>> ioengine=libaio, iodepth=32

>>>> fio-2.1.11

>>>> Starting 1 process

>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [214.7MB/0KB/0KB /s] [54.1K/0/0

>>>> iops] [eta 00m:00s]

>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=894: Wed Jun 10

>>>> 05:54:24 2015 read : io=5120.0MB, bw=201108KB/s, iops=50276, runt=

>>>> 26070msec slat (usec): min=1, max=1136, avg= 3.54, stdev= 3.58 clat

>>>> (usec): min=128, max=6262, avg=631.41, stdev=197.71 lat (usec):

>>>> min=149, max=6265, avg=635.27, stdev=197.40 clat percentiles (usec):

>>>> | 1.00th=[ 318], 5.00th=[ 378], 10.00th=[ 418], 20.00th=[ 474],

>>>> | 30.00th=[ 516], 40.00th=[ 564], 50.00th=[ 612], 60.00th=[ 652],

>>>> | 70.00th=[ 700], 80.00th=[ 756], 90.00th=[ 860], 95.00th=[ 980],

>>>> | 99.00th=[ 1272], 99.50th=[ 1384], 99.90th=[ 1688], 99.95th=[ 1896],

>>>> | 99.99th=[ 3760]

>>>> bw (KB /s): min=145608, max=249688, per=100.00%, avg=201108.00,

>>>> stdev=21718.87 lat (usec) : 250=0.04%, 500=25.84%, 750=53.00%,

>>>> 1000=16.63% lat (msec) : 2=4.46%, 4=0.03%, 10=0.01% cpu : usr=9.73%,

>>>> sys=24.93%, ctx=66417, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%,

>>>> 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%,

>>>> 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete :

>>>> 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%

>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 latency :

>>>> target=0, window=0, percentile=100.00%, depth=32

>>>>

>>>> Run status group 0 (all jobs):

>>>> READ: io=5120.0MB, aggrb=201107KB/s, minb=201107KB/s,

>>>> maxb=201107KB/s, mint=26070msec, maxt=26070msec

>>>>

>>>> Disk stats (read/write):

>>>> vdb: ios=1302555/0, merge=0/0, ticks=715176/0, in_queue=714840,

>>>> util=99.73%

>>>>

>>>>

>>>>

>>>>

>>>>

>>>>

>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,

>>>> ioengine=libaio, iodepth=32

>>>> fio-2.1.11

>>>> Starting 1 process

>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [158.7MB/0KB/0KB /s] [40.6K/0/0

>>>> iops] [eta 00m:00s]

>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=889: Wed Jun 10

>>>> 06:05:06 2015 read : io=5120.0MB, bw=143897KB/s, iops=35974, runt=

>>>> 36435msec slat (usec): min=1, max=710, avg= 3.31, stdev= 3.35 clat

>>>> (usec): min=191, max=4740, avg=884.66, stdev=315.65 lat (usec):

>>>> min=289, max=4743, avg=888.31, stdev=315.51 clat percentiles (usec):

>>>> | 1.00th=[ 462], 5.00th=[ 516], 10.00th=[ 548], 20.00th=[ 596],

>>>> | 30.00th=[ 652], 40.00th=[ 764], 50.00th=[ 868], 60.00th=[ 940],

>>>> | 70.00th=[ 1004], 80.00th=[ 1096], 90.00th=[ 1256], 95.00th=[ 1416],

>>>> | 99.00th=[ 2024], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2640],

>>>> | 99.99th=[ 3632]

>>>> bw (KB /s): min=98352, max=177328, per=99.91%, avg=143772.11,

>>>> stdev=21782.39 lat (usec) : 250=0.01%, 500=3.48%, 750=35.69%,

>>>> 1000=30.01% lat (msec) : 2=29.74%, 4=1.07%, 10=0.01% cpu :

>>>> usr=7.10%, sys=16.90%, ctx=54855, majf=0, minf=38 IO depths :

>>>> 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit

>>>> : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,

>>>> >=64=0.0% issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0

>>>> latency : target=0, window=0, percentile=100.00%, depth=32

>>>>

>>>> Run status group 0 (all jobs):

>>>> READ: io=5120.0MB, aggrb=143896KB/s, minb=143896KB/s,

>>>> maxb=143896KB/s, mint=36435msec, maxt=36435msec

>>>>

>>>> Disk stats (read/write):

>>>> vdb: ios=1301357/0, merge=0/0, ticks=1033036/0, in_queue=1032716,

>>>> util=99.85%

>>>>

>>>>

>>>> ----- Mail original -----

>>>> De: "aderumier" < aderumier@xxxxxxxxx >

>>>> À: "Robert LeBlanc" < robert@xxxxxxxxxxxxx >

>>>> Cc: "Mark Nelson" < mnelson@xxxxxxxxxx >, "ceph-devel" <

>>>> ceph-devel@xxxxxxxxxxxxxxx >, "pushpesh sharma" <

>>>> pushpesh.eck@xxxxxxxxx >, "ceph-users" < ceph-users@xxxxxxxxxxxxxx >

>>>> Envoyé: Mardi 9 Juin 2015 18:47:27

>>>> Objet: Re:  rbd_cache, limiting read on high iops around

>>>> 40k

>>>>

>>>> Hi Robert,

>>>>

>>>>>> What I found was that Ceph OSDs performed well with either

>>>>>> tcmalloc or

>>>>>> jemalloc (except when RocksDB was built with jemalloc instead of

>>>>>> tcmalloc, I'm still working to dig into why that might be the case).

>>>> yes,from my test, for osd tcmalloc is a little faster (but very

>>>> little) than jemalloc.

>>>>

>>>>

>>>>

>>>>>> However, I found that tcmalloc with QEMU/KVM was very detrimental to

>>>>>> small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much

>>>>>> better for QEMU/KVM in the tests that we ran. [1]

>>>>

>>>>

>>>> Just have done qemu test (4k randread - rbd_cache=off), I don't see

>>>> speed regression with tcmalloc.

>>>> with qemu iothread, tcmalloc have a speed increase over glib

>>>> with qemu iothread, jemalloc have a speed decrease

>>>>

>>>> without iothread, jemalloc have a big speed increase

>>>>

>>>> this is with

>>>> -qemu 2.3

>>>> -tcmalloc 2.2.1

>>>> -jemmaloc 3.6

>>>> -libc6 2.19

>>>>

>>>>

>>>> qemu : no iothread : glibc : iops=33395

>>>> qemu : no-iothread : tcmalloc : iops=34516 (+3%)

>>>> qemu : no-iothread : jemmaloc : iops=42226 (+26%)

>>>>

>>>> qemu : iothread : glibc : iops=34516

>>>> qemu : iothread : tcmalloc : iops=38676 (+12%)

>>>> qemu : iothread : jemmaloc : iops=28023 (-19%)

>>>>

>>>>

>>>> (The benefit of iothreads is that we can scale with more disks in 1vm)

>>>>

>>>>

>>>> fio results:

>>>> ------------

>>>>

>>>> qemu : iothread : tcmalloc : iops=38676

>>>> -----------------------------------------

>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,

>>>> ioengine=libaio, iodepth=32

>>>> fio-2.1.11

>>>> Starting 1 process

>>>> Jobs: 1 (f=0): [r(1)] [100.0% done] [123.5MB/0KB/0KB /s] [31.6K/0/0

>>>> iops] [eta 00m:00s]

>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=1265: Tue Jun 9

>>>> 18:16:53 2015

>>>> read : io=5120.0MB, bw=154707KB/s, iops=38676, runt= 33889msec

>>>> slat (usec): min=1, max=715, avg= 3.63, stdev= 3.42

>>>> clat (usec): min=152, max=5736, avg=822.12, stdev=289.34

>>>> lat (usec): min=231, max=5740, avg=826.10, stdev=289.08

>>>> clat percentiles (usec):

>>>> | 1.00th=[ 402], 5.00th=[ 466], 10.00th=[ 510], 20.00th=[ 572],

>>>> | 30.00th=[ 636], 40.00th=[ 716], 50.00th=[ 780], 60.00th=[ 852],

>>>> | 70.00th=[ 932], 80.00th=[ 1020], 90.00th=[ 1160], 95.00th=[ 1352],

>>>> | 99.00th=[ 1800], 99.50th=[ 1944], 99.90th=[ 2256], 99.95th=[ 2448],

>>>> | 99.99th=[ 3888]

>>>> bw (KB /s): min=123888, max=198584, per=100.00%, avg=154824.40,

>>>> stdev=16978.03

>>>> lat (usec) : 250=0.01%, 500=8.91%, 750=36.44%, 1000=32.63%

>>>> lat (msec) : 2=21.65%, 4=0.37%, 10=0.01%

>>>> cpu : usr=8.29%, sys=19.76%, ctx=55882, majf=0, minf=39

>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,

>>>> >=64=0.0%

>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,

>>>> >=64=0.0%

>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0

>>>> latency : target=0, window=0, percentile=100.00%, depth=32

>>>>

>>>> Run status group 0 (all jobs):

>>>> READ: io=5120.0MB, aggrb=154707KB/s, minb=154707KB/s,

>>>> maxb=154707KB/s, mint=33889msec, maxt=33889msec

>>>>

>>>> Disk stats (read/write):

>>>> vdb: ios=1302739/0, merge=0/0, ticks=934444/0, in_queue=934096,

>>>> util=99.77%

>>>>

>>>>

>>>>

>>>> qemu : no-iothread : tcmalloc : iops=34516

>>>> ---------------------------------------------

>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [163.2MB/0KB/0KB /s] [41.8K/0/0

>>>> iops] [eta 00m:00s]

>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=896: Tue Jun 9

>>>> 18:19:08 2015

>>>> read : io=5120.0MB, bw=138065KB/s, iops=34516, runt= 37974msec

>>>> slat (usec): min=1, max=708, avg= 3.98, stdev= 3.57

>>>> clat (usec): min=208, max=11858, avg=921.43, stdev=333.61

>>>> lat (usec): min=266, max=11862, avg=925.77, stdev=333.40

>>>> clat percentiles (usec):

>>>> | 1.00th=[ 434], 5.00th=[ 510], 10.00th=[ 564], 20.00th=[ 652],

>>>> | 30.00th=[ 732], 40.00th=[ 812], 50.00th=[ 876], 60.00th=[ 940],

>>>> | 70.00th=[ 1020], 80.00th=[ 1112], 90.00th=[ 1320], 95.00th=[ 1576],

>>>> | 99.00th=[ 1992], 99.50th=[ 2128], 99.90th=[ 2736], 99.95th=[ 3248],

>>>> | 99.99th=[ 4320]

>>>> bw (KB /s): min=77312, max=185576, per=99.74%, avg=137709.88,

>>>> stdev=16883.77

>>>> lat (usec) : 250=0.01%, 500=4.36%, 750=27.61%, 1000=35.60%

>>>> lat (msec) : 2=31.49%, 4=0.92%, 10=0.02%, 20=0.01%

>>>> cpu : usr=7.19%, sys=19.52%, ctx=55903, majf=0, minf=38

>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,

>>>> >=64=0.0%

>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,

>>>> >=64=0.0%

>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0

>>>> latency : target=0, window=0, percentile=100.00%, depth=32

>>>>

>>>> Run status group 0 (all jobs):

>>>> READ: io=5120.0MB, aggrb=138064KB/s, minb=138064KB/s,

>>>> maxb=138064KB/s, mint=37974msec, maxt=37974msec

>>>>

>>>> Disk stats (read/write):

>>>> vdb: ios=1309902/0, merge=0/0, ticks=1068768/0, in_queue=1068396,

>>>> util=99.86%

>>>>

>>>>

>>>>

>>>> qemu : iothread : glibc : iops=34516

>>>> -------------------------------------

>>>>

>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,

>>>> ioengine=libaio, iodepth=32

>>>> fio-2.1.11

>>>> Starting 1 process

>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [133.4MB/0KB/0KB /s] [34.2K/0/0

>>>> iops] [eta 00m:00s]

>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=876: Tue Jun 9

>>>> 18:24:01 2015

>>>> read : io=5120.0MB, bw=137786KB/s, iops=34446, runt= 38051msec

>>>> slat (usec): min=1, max=496, avg= 3.88, stdev= 3.66

>>>> clat (usec): min=283, max=7515, avg=923.34, stdev=300.28

>>>> lat (usec): min=286, max=7519, avg=927.58, stdev=300.02

>>>> clat percentiles (usec):

>>>> | 1.00th=[ 506], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652],

>>>> | 30.00th=[ 724], 40.00th=[ 804], 50.00th=[ 884], 60.00th=[ 964],

>>>> | 70.00th=[ 1048], 80.00th=[ 1144], 90.00th=[ 1304], 95.00th=[ 1448],

>>>> | 99.00th=[ 1896], 99.50th=[ 2096], 99.90th=[ 2480], 99.95th=[ 2640],

>>>> | 99.99th=[ 3984]

>>>> bw (KB /s): min=102680, max=171112, per=100.00%, avg=137877.78,

>>>> stdev=15521.30

>>>> lat (usec) : 500=0.84%, 750=32.97%, 1000=30.82%

>>>> lat (msec) : 2=34.65%, 4=0.71%, 10=0.01%

>>>> cpu : usr=7.42%, sys=19.47%, ctx=52455, majf=0, minf=38

>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,

>>>> >=64=0.0%

>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,

>>>> >=64=0.0%

>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0

>>>> latency : target=0, window=0, percentile=100.00%, depth=32

>>>>

>>>> Run status group 0 (all jobs):

>>>> READ: io=5120.0MB, aggrb=137785KB/s, minb=137785KB/s,

>>>> maxb=137785KB/s, mint=38051msec, maxt=38051msec

>>>>

>>>> Disk stats (read/write):

>>>> vdb: ios=1307426/0, merge=0/0, ticks=1051416/0, in_queue=1050972,

>>>> util=99.85%

>>>>

>>>>

>>>>

>>>> qemu : no iothread : glibc : iops=33395

>>>> -----------------------------------------

>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,

>>>> ioengine=libaio, iodepth=32

>>>> fio-2.1.11

>>>> Starting 1 process

>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [125.4MB/0KB/0KB /s] [32.9K/0/0

>>>> iops] [eta 00m:00s]

>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=886: Tue Jun 9

>>>> 18:27:18 2015

>>>> read : io=5120.0MB, bw=133583KB/s, iops=33395, runt= 39248msec

>>>> slat (usec): min=1, max=1054, avg= 3.86, stdev= 4.29

>>>> clat (usec): min=139, max=12635, avg=952.85, stdev=335.51

>>>> lat (usec): min=303, max=12638, avg=957.01, stdev=335.29

>>>> clat percentiles (usec):

>>>> | 1.00th=[ 516], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652],

>>>> | 30.00th=[ 724], 40.00th=[ 820], 50.00th=[ 924], 60.00th=[ 996],

>>>> | 70.00th=[ 1080], 80.00th=[ 1176], 90.00th=[ 1336], 95.00th=[ 1528],

>>>> | 99.00th=[ 2096], 99.50th=[ 2320], 99.90th=[ 2672], 99.95th=[ 2928],

>>>> | 99.99th=[ 4832]

>>>> bw (KB /s): min=98136, max=171624, per=100.00%, avg=133682.64,

>>>> stdev=19121.91

>>>> lat (usec) : 250=0.01%, 500=0.57%, 750=32.57%, 1000=26.98%

>>>> lat (msec) : 2=38.59%, 4=1.28%, 10=0.01%, 20=0.01%

>>>> cpu : usr=9.24%, sys=15.92%, ctx=51219, majf=0, minf=38

>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,

>>>> >=64=0.0%

>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,

>>>> >=64=0.0%

>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0

>>>> latency : target=0, window=0, percentile=100.00%, depth=32

>>>>

>>>> Run status group 0 (all jobs):

>>>> READ: io=5120.0MB, aggrb=133583KB/s, minb=133583KB/s,

>>>> maxb=133583KB/s, mint=39248msec, maxt=39248msec

>>>>

>>>> Disk stats (read/write):

>>>> vdb: ios=1304526/0, merge=0/0, ticks=1075020/0, in_queue=1074536,

>>>> util=99.84%

>>>>

>>>>

>>>>

>>>> qemu : iothread : jemmaloc : iops=28023

>>>> ----------------------------------------

>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,

>>>> ioengine=libaio, iodepth=32

>>>> fio-2.1.11

>>>> Starting 1 process

>>>> Jobs: 1 (f=1): [r(1)] [97.9% done] [155.2MB/0KB/0KB /s] [39.1K/0/0

>>>> iops] [eta 00m:01s]

>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=899: Tue Jun 9

>>>> 18:30:26 2015

>>>> read : io=5120.0MB, bw=112094KB/s, iops=28023, runt= 46772msec

>>>> slat (usec): min=1, max=467, avg= 4.33, stdev= 4.77

>>>> clat (usec): min=253, max=11307, avg=1135.63, stdev=346.55

>>>> lat (usec): min=256, max=11309, avg=1140.39, stdev=346.22

>>>> clat percentiles (usec):

>>>> | 1.00th=[ 510], 5.00th=[ 628], 10.00th=[ 700], 20.00th=[ 820],

>>>> | 30.00th=[ 924], 40.00th=[ 1032], 50.00th=[ 1128], 60.00th=[ 1224],

>>>> | 70.00th=[ 1320], 80.00th=[ 1416], 90.00th=[ 1560], 95.00th=[ 1688],

>>>> | 99.00th=[ 2096], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2832],

>>>> | 99.99th=[ 3760]

>>>> bw (KB /s): min=91792, max=174416, per=99.90%, avg=111985.27,

>>>> stdev=17381.70

>>>> lat (usec) : 500=0.80%, 750=13.10%, 1000=23.33%

>>>> lat (msec) : 2=61.30%, 4=1.46%, 10=0.01%, 20=0.01%

>>>> cpu : usr=7.12%, sys=17.43%, ctx=54507, majf=0, minf=38

>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,

>>>> >=64=0.0%

>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,

>>>> >=64=0.0%

>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0

>>>> latency : target=0, window=0, percentile=100.00%, depth=32

>>>>

>>>> Run status group 0 (all jobs):

>>>> READ: io=5120.0MB, aggrb=112094KB/s, minb=112094KB/s,

>>>> maxb=112094KB/s, mint=46772msec, maxt=46772msec

>>>>

>>>> Disk stats (read/write):

>>>> vdb: ios=1309169/0, merge=0/0, ticks=1305796/0, in_queue=1305376,

>>>> util=98.68%

>>>>

>>>>

>>>>

>>>> qemu : non-iothread : jemmaloc : iops=42226

>>>> --------------------------------------------

>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,

>>>> ioengine=libaio, iodepth=32

>>>> fio-2.1.11

>>>> Starting 1 process

>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.2MB/0KB/0KB /s] [43.9K/0/0

>>>> iops] [eta 00m:00s]

>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=892: Tue Jun 9

>>>> 18:34:11 2015

>>>> read : io=5120.0MB, bw=177130KB/s, iops=44282, runt= 29599msec

>>>> slat (usec): min=1, max=527, avg= 3.80, stdev= 3.74

>>>> clat (usec): min=174, max=3841, avg=717.08, stdev=237.53

>>>> lat (usec): min=210, max=3844, avg=721.23, stdev=237.22

>>>> clat percentiles (usec):

>>>> | 1.00th=[ 354], 5.00th=[ 422], 10.00th=[ 462], 20.00th=[ 516],

>>>> | 30.00th=[ 572], 40.00th=[ 628], 50.00th=[ 684], 60.00th=[ 740],

>>>> | 70.00th=[ 804], 80.00th=[ 884], 90.00th=[ 1004], 95.00th=[ 1128],

>>>> | 99.00th=[ 1544], 99.50th=[ 1672], 99.90th=[ 1928], 99.95th=[ 2064],

>>>> | 99.99th=[ 2608]

>>>> bw (KB /s): min=138120, max=230816, per=100.00%, avg=177192.14,

>>>> stdev=23440.79

>>>> lat (usec) : 250=0.01%, 500=16.24%, 750=45.93%, 1000=27.46%

>>>> lat (msec) : 2=10.30%, 4=0.07%

>>>> cpu : usr=10.14%, sys=23.84%, ctx=60938, majf=0, minf=39

>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,

>>>> >=64=0.0%

>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,

>>>> >=64=0.0%

>>>> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0

>>>> latency : target=0, window=0, percentile=100.00%, depth=32

>>>>

>>>> Run status group 0 (all jobs):

>>>> READ: io=5120.0MB, aggrb=177130KB/s, minb=177130KB/s,

>>>> maxb=177130KB/s, mint=29599msec, maxt=29599msec

>>>>

>>>> Disk stats (read/write):

>>>> vdb: ios=1303992/0, merge=0/0, ticks=798008/0, in_queue=797636,

>>>> util=99.80%

>>>>

>>>>

>>>>

>>>> ----- Mail original -----

>>>> De: "Robert LeBlanc" < robert@xxxxxxxxxxxxx >

>>>> À: "aderumier" < aderumier@xxxxxxxxx >

>>>> Cc: "Mark Nelson" < mnelson@xxxxxxxxxx >, "ceph-devel" <

>>>> ceph-devel@xxxxxxxxxxxxxxx >, "pushpesh sharma" <

>>>> pushpesh.eck@xxxxxxxxx >, "ceph-users" < ceph-users@xxxxxxxxxxxxxx >

>>>> Envoyé: Mardi 9 Juin 2015 18:00:29

>>>> Objet: Re:  rbd_cache, limiting read on high iops around

>>>> 40k

>>>>

>>>> -----BEGIN PGP SIGNED MESSAGE-----

>>>> Hash: SHA256

>>>>

>>>> I also saw a similar performance increase by using alternative memory

>>>> allocators. What I found was that Ceph OSDs performed well with either

>>>> tcmalloc or jemalloc (except when RocksDB was built with jemalloc

>>>> instead of tcmalloc, I'm still working to dig into why that might be

>>>> the case).

>>>>

>>>> However, I found that tcmalloc with QEMU/KVM was very detrimental to

>>>> small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much

>>>> better for QEMU/KVM in the tests that we ran. [1]

>>>>

>>>> I'm currently looking into I/O bottlenecks around the 16KB range and

>>>> I'm seeing a lot of time in thread creation and destruction, the

>>>> memory allocators are quite a bit down the list (both fio with

>>>> ioengine rbd and on the OSDs). I wonder what the difference can be.

>>>> I've tried using the async messenger but there wasn't a huge

>>>> difference. [2]

>>>>

>>>> Further down the rabbit hole....

>>>>

>>>> [1]

>>>> https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg20197.html

>>>> [2]

>>>> https://www.mail-archive.com/ceph-devel@xxxxxxxxxxxxxxx/msg23982.html

>>>> -----BEGIN PGP SIGNATURE-----

>>>> Version: Mailvelope v0.13.1

>>>> Comment: https://www.mailvelope.com

>>>>

>>>> wsFcBAEBCAAQBQJVdw2ZCRDmVDuy+mK58QAA4MwP/1vt65cvTyyVGGSGRrE8

>>>> unuWjafMHzl486XH+EaVrDVTXFVFOoncJ6kugSpD7yavtCpZNdhsIaTRZguU

>>>> YpfAppNAJU5biSwNv9QPI7kPP2q2+I7Z8ZkvhcVnkjIythoeNnSjV7zJrw87

>>>> afq46GhPHqEXdjp3rOB4RRPniOMnub5oU6QRnKn3HPW8Dx9ZqTeCofRDnCY2

>>>> S695Dt1gzt0ERUOgrUUkt0FQJdkkV6EURcUschngjtEd5727VTLp02HivVl3

>>>> vDYWxQHPK8oS6Xe8GOW0JjulwiqlYotSlrqSU5FMU5gozbk9zMFPIUW1e+51

>>>> 9ART8Ta2ItMhPWtAhRwwvxgy51exCy9kBc+m+ptKW5XRUXOImGcOQxszPGOO

>>>> qIIOG1vVG/GBmo/0i6tliqBFYdXmw1qFV7tFiIbisZRH7Q/1NahjYTHqHhu3

>>>> Dv61T6WrerD+9N6S1Lrz1QYe2Fqa56BHhHSXM82NE86SVxEvUkoGegQU+c7b

>>>> 6rY1JvuJHJzva7+M2XHApYCchCs4a1Yyd1qWB7yThJD57RIyX1TOg0+siV13

>>>> R+v6wxhQU0vBovH+5oAWmCZaPNT+F0Uvs3xWAxxaIR9r83wMj9qQeBZTKVzQ

>>>> 1aFIi15KqAwOp12yWCmrqKTeXhjwYQNd8viCQCGN7AQyPglmzfbuEHalVjz4

>>>> oSJX

>>>> =k281

>>>> -----END PGP SIGNATURE-----

>>>> ----------------

>>>> Robert LeBlanc

>>>> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1

>>>>

>>>>

>>>> On Tue, Jun 9, 2015 at 6:02 AM, Alexandre DERUMIER <

>>>> aderumier@xxxxxxxxx > wrote:

>>>>>>> Frankly, I'm a little impressed that without RBD cache we can hit

>>>>>>> 80K

>>>>>>> IOPS from 1 VM!

>>>>>

>>>>> Note that theses result are not in a vm (fio-rbd on host), so in a

>>>>> vm we'll have overhead.

>>>>> (I'm planning to send results in qemu soon)

>>>>>

>>>>>>> How fast are the SSDs in those 3 OSDs?

>>>>>

>>>>> Theses results are with datas in buffer memory of osd nodes.

>>>>>

>>>>> When reading fulling on ssd (intel s3500),

>>>>>

>>>>> For 1 client,

>>>>>

>>>>> I'm around 33k iops without cache and 32k iops with cache, with 1 osd.

>>>>> I'm around 55k iops without cache and 38k iops with cache, with 3 osd.

>>>>>

>>>>> with multiple clients jobs, I can reach around 70kiops by osd , and

>>>>> 250k iops by osd when datas are in buffer.

>>>>>

>>>>> (cpus servers/clients are 2x 10 cores 3,1ghz e5 xeon)

>>>>>

>>>>>

>>>>>

>>>>> small tip :

>>>>> I'm using tcmalloc for fio-rbd or rados bench to improve latencies

>>>>> by around 20%

>>>>>

>>>>> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 fio ...

>>>>> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 rados bench ...

>>>>>

>>>>> as a lot of time is spent in malloc/free

>>>>>

>>>>>

>>>>> (qemu support also tcmalloc since some months , I'll bench it too

>>>>> https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg05372.html )

>>>>>

>>>>>

>>>>>

>>>>> I'll try to send full bench results soon, from 1 to 18 ssd osd.

>>>>>

>>>>>

>>>>>

>>>>>

>>>>> ----- Mail original -----

>>>>> De: "Mark Nelson" < mnelson@xxxxxxxxxx >

>>>>> À: "aderumier" < aderumier@xxxxxxxxx >, "pushpesh sharma" <

>>>>> pushpesh.eck@xxxxxxxxx >

>>>>> Cc: "ceph-devel" < ceph-devel@xxxxxxxxxxxxxxx >, "ceph-users" <

>>>>> ceph-users@xxxxxxxxxxxxxx >

>>>>> Envoyé: Mardi 9 Juin 2015 13:36:31

>>>>> Objet: Re:  rbd_cache, limiting read on high iops

>>>>> around 40k

>>>>>

>>>>> Hi All,

>>>>>

>>>>> In the past we've hit some performance issues with RBD cache that

>>>>> we've

>>>>> fixed, but we've never really tried pushing a single VM beyond 40+K

>>>>> read

>>>>> IOPS in testing (or at least I never have). I suspect there's a couple

>>>>> of possibilities as to why it might be slower, but perhaps joshd can

>>>>> chime in as he's more familiar with what that code looks like.

>>>>>

>>>>> Frankly, I'm a little impressed that without RBD cache we can hit 80K

>>>>> IOPS from 1 VM! How fast are the SSDs in those 3 OSDs?

>>>>>

>>>>> Mark

>>>>>

>>>>> On 06/09/2015 03:36 AM, Alexandre DERUMIER wrote:

>>>>>> It's seem that the limit is mainly going in high queue depth (+- >

>>>>>> 16)

>>>>>>

>>>>>> Here the result in iops with 1client- 4krandread- 3osd - with

>>>>>> differents queue depth size.

>>>>>> rbd_cache is almost the same than without cache with queue depth <16

>>>>>>

>>>>>>

>>>>>> cache

>>>>>> -----

>>>>>> qd1: 1651

>>>>>> qd2: 3482

>>>>>> qd4: 7958

>>>>>> qd8: 17912

>>>>>> qd16: 36020

>>>>>> qd32: 42765

>>>>>> qd64: 46169

>>>>>>

>>>>>> no cache

>>>>>> --------

>>>>>> qd1: 1748

>>>>>> qd2: 3570

>>>>>> qd4: 8356

>>>>>> qd8: 17732

>>>>>> qd16: 41396

>>>>>> qd32: 78633

>>>>>> qd64: 79063

>>>>>> qd128: 79550

>>>>>>

>>>>>>

>>>>>> ----- Mail original -----

>>>>>> De: "aderumier" < aderumier@xxxxxxxxx >

>>>>>> À: "pushpesh sharma" < pushpesh.eck@xxxxxxxxx >

>>>>>> Cc: "ceph-devel" < ceph-devel@xxxxxxxxxxxxxxx >, "ceph-users" <

>>>>>> ceph-users@xxxxxxxxxxxxxx >

>>>>>> Envoyé: Mardi 9 Juin 2015 09:28:21

>>>>>> Objet: Re:  rbd_cache, limiting read on high iops

>>>>>> around 40k

>>>>>>

>>>>>> Hi,

>>>>>>

>>>>>>>> We tried adding more RBDs to single VM, but no luck.

>>>>>>

>>>>>> If you want to scale with more disks in a single qemu vm, you need

>>>>>> to use iothread feature from qemu and assign 1 iothread by disk

>>>>>> (works with virtio-blk).

>>>>>> It's working for me, I can scale with adding more disks.

>>>>>>

>>>>>>

>>>>>> My bench here are done with fio-rbd on host.

>>>>>> I can scale up to 400k iops with 10clients-rbd_cache=off on a

>>>>>> single host and around 250kiops 10clients-rbdcache=on.

>>>>>>

>>>>>>

>>>>>> I just wonder why I don't have performance decrease around 30k

>>>>>> iops with 1osd.

>>>>>>

>>>>>> I'm going to see if this tracker

>>>>>> http://tracker.ceph.com/issues/11056

>>>>>>

>>>>>> could be the cause.

>>>>>>

>>>>>> (My master build was done some week ago)

>>>>>>

>>>>>>

>>>>>>

>>>>>> ----- Mail original -----

>>>>>> De: "pushpesh sharma" < pushpesh.eck@xxxxxxxxx >

>>>>>> À: "aderumier" < aderumier@xxxxxxxxx >

>>>>>> Cc: "ceph-devel" < ceph-devel@xxxxxxxxxxxxxxx >, "ceph-users" <

>>>>>> ceph-users@xxxxxxxxxxxxxx >

>>>>>> Envoyé: Mardi 9 Juin 2015 09:21:04

>>>>>> Objet: Re: rbd_cache, limiting read on high iops around 40k

>>>>>>

>>>>>> Hi Alexandre,

>>>>>>

>>>>>> We have also seen something very similar on Hammer(0.94-1). We

>>>>>> were doing some benchmarking for VMs hosted on hypervisor

>>>>>> (QEMU-KVM, openstack-juno). Each Ubuntu-VM has a RBD as root disk,

>>>>>> and 1 RBD as additional storage. For some strange reason it was

>>>>>> not able to scale 4K- RR iops on each VM beyond 35-40k. We tried

>>>>>> adding more RBDs to single VM, but no luck. However increasing

>>>>>> number of VMs to 4 on a single hypervisor did scale to some

>>>>>> extent. After this there was no much benefit we got from adding

>>>>>> more VMs.

>>>>>>

>>>>>> Here is the trend we have seen, x-axis is number of hypervisor,

>>>>>> each hypervisor has 4 VM, each VM has 1 RBD:-

>>>>>>

>>>>>>

>>>>>>

>>>>>>

>>>>>> VDbench is used as benchmarking tool. We were not saturating

>>>>>> network and CPUs at OSD nodes. We were not able to saturate CPUs

>>>>>> at hypervisors, and that is where we were suspecting of some

>>>>>> throttling effect. However we haven't setted any such limits from

>>>>>> nova or kvm end. We tried some CPU pinning and other KVM related

>>>>>> tuning as well, but no luck.

>>>>>>

>>>>>> We tried the same experiment on a bare metal. It was 4K RR IOPs

>>>>>> were scaling from 40K(1 RBD) to 180K(4 RBDs). But after that

>>>>>> rather than scaling beyond that point the numbers were actually

>>>>>> degrading. (Single pipe more congestion effect)

>>>>>>

>>>>>> We never suspected that rbd cache enable could be detrimental to

>>>>>> performance. It would nice to route cause the problem if that is

>>>>>> the case.

>>>>>>

>>>>>> On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER <

>>>>>> aderumier@xxxxxxxxx > wrote:

>>>>>>

>>>>>>

>>>>>> Hi,

>>>>>>

>>>>>> I'm doing benchmark (ceph master branch), with randread 4k qdepth=32,

>>>>>> and rbd_cache=true seem to limit the iops around 40k

>>>>>>

>>>>>>

>>>>>> no cache

>>>>>> --------

>>>>>> 1 client - rbd_cache=false - 1osd : 38300 iops

>>>>>> 1 client - rbd_cache=false - 2osd : 69073 iops

>>>>>> 1 client - rbd_cache=false - 3osd : 78292 iops

>>>>>>

>>>>>>

>>>>>> cache

>>>>>> -----

>>>>>> 1 client - rbd_cache=true - 1osd : 38100 iops

>>>>>> 1 client - rbd_cache=true - 2osd : 42457 iops

>>>>>> 1 client - rbd_cache=true - 3osd : 45823 iops

>>>>>>

>>>>>>

>>>>>>

>>>>>> Is it expected ?

>>>>>>

>>>>>>

>>>>>>

>>>>>> fio result rbd_cache=false 3 osd

>>>>>> --------------------------------

>>>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,

>>>>>> ioengine=rbd, iodepth=32

>>>>>> fio-2.1.11

>>>>>> Starting 1 process

>>>>>> rbd engine: RBD version: 0.1.9

>>>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s]

>>>>>> [78.8K/0/0 iops] [eta 00m:00s]

>>>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue

>>>>>> Jun 9 07:48:42 2015

>>>>>> read : io=10000MB, bw=313169KB/s, iops=78292, runt= 32698msec

>>>>>> slat (usec): min=5, max=530, avg=11.77, stdev= 6.77

>>>>>> clat (usec): min=70, max=2240, avg=336.08, stdev=94.82

>>>>>> lat (usec): min=101, max=2247, avg=347.84, stdev=95.49

>>>>>> clat percentiles (usec):

>>>>>> | 1.00th=[ 173], 5.00th=[ 209], 10.00th=[ 231], 20.00th=[ 262],

>>>>>> | 30.00th=[ 282], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 346],

>>>>>> | 70.00th=[ 370], 80.00th=[ 402], 90.00th=[ 454], 95.00th=[ 506],

>>>>>> | 99.00th=[ 628], 99.50th=[ 692], 99.90th=[ 860], 99.95th=[ 948],

>>>>>> | 99.99th=[ 1176]

>>>>>> bw (KB /s): min=238856, max=360448, per=100.00%, avg=313402.34,

>>>>>> stdev=25196.21

>>>>>> lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23%

>>>>>> lat (msec) : 2=0.03%, 4=0.01%

>>>>>> cpu : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452

>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%,

>>>>>> >=64=0.0%

>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,

>>>>>> >=64=0.0%

>>>>>> complete : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%,

>>>>>> >=64=0.0%

>>>>>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0

>>>>>> latency : target=0, window=0, percentile=100.00%, depth=32

>>>>>>

>>>>>> Run status group 0 (all jobs):

>>>>>> READ: io=10000MB, aggrb=313169KB/s, minb=313169KB/s,

>>>>>> maxb=313169KB/s, mint=32698msec, maxt=32698msec

>>>>>>

>>>>>> Disk stats (read/write):

>>>>>> dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,

>>>>>> aggrios=0/24, aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0,

>>>>>> aggrutil=0.00%

>>>>>> sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00%

>>>>>>

>>>>>>

>>>>>>

>>>>>>

>>>>>> fio result rbd_cache=true 3osd

>>>>>> ------------------------------

>>>>>>

>>>>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,

>>>>>> ioengine=rbd, iodepth=32

>>>>>> fio-2.1.11

>>>>>> Starting 1 process

>>>>>> rbd engine: RBD version: 0.1.9

>>>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s]

>>>>>> [43.1K/0/0 iops] [eta 00m:00s]

>>>>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue

>>>>>> Jun 9 07:47:30 2015

>>>>>> read : io=10000MB, bw=183296KB/s, iops=45823, runt= 55866msec

>>>>>> slat (usec): min=7, max=805, avg=21.26, stdev=15.84

>>>>>> clat (usec): min=101, max=4602, avg=478.55, stdev=143.73

>>>>>> lat (usec): min=123, max=4669, avg=499.80, stdev=146.03

>>>>>> clat percentiles (usec):

>>>>>> | 1.00th=[ 227], 5.00th=[ 274], 10.00th=[ 306], 20.00th=[ 350],

>>>>>> | 30.00th=[ 390], 40.00th=[ 430], 50.00th=[ 470], 60.00th=[ 506],

>>>>>> | 70.00th=[ 548], 80.00th=[ 596], 90.00th=[ 660], 95.00th=[ 724],

>>>>>> | 99.00th=[ 844], 99.50th=[ 908], 99.90th=[ 1112], 99.95th=[ 1288],

>>>>>> | 99.99th=[ 2192]

>>>>>> bw (KB /s): min=115280, max=204416, per=100.00%, avg=183315.10,

>>>>>> stdev=15079.93

>>>>>> lat (usec) : 250=2.42%, 500=55.61%, 750=38.48%, 1000=3.28%

>>>>>> lat (msec) : 2=0.19%, 4=0.01%, 10=0.01%

>>>>>> cpu : usr=60.27%, sys=12.01%, ctx=2995393, majf=0, minf=14100

>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=13.5%, 16=81.0%, 32=5.3%,

>>>>>> >=64=0.0%

>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,

>>>>>> >=64=0.0%

>>>>>> complete : 0=0.0%, 4=95.0%, 8=0.1%, 16=1.0%, 32=4.0%, 64=0.0%,

>>>>>> >=64=0.0%

>>>>>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0

>>>>>> latency : target=0, window=0, percentile=100.00%, depth=32

>>>>>>

>>>>>> Run status group 0 (all jobs):

>>>>>> READ: io=10000MB, aggrb=183295KB/s, minb=183295KB/s,

>>>>>> maxb=183295KB/s, mint=55866msec, maxt=55866msec

>>>>>>

>>>>>> Disk stats (read/write):

>>>>>> dm-0: ios=0/61, merge=0/0, ticks=0/8, in_queue=8, util=0.01%,

>>>>>> aggrios=0/29, aggrmerge=0/32, aggrticks=0/8, aggrin_queue=8,

>>>>>> aggrutil=0.01%

>>>>>> sda: ios=0/29, merge=0/32, ticks=0/8, in_queue=8, util=0.01%

>>>>>>

>>>>> _______________________________________________

>>>>> ceph-users mailing list

>>>>> ceph-users@xxxxxxxxxxxxxx

>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>> _______________________________________________

>>>> ceph-users mailing list

>>>> ceph-users@xxxxxxxxxxxxxx

>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>>

>>>>

>>>>

>>>>

>>>>

>>>>

>>>> --

>>>> С уважением, Фасихов Ирек Нургаязович

>>>> Моб.: +79229045757

>>>> _______________________________________________

>>>> ceph-users mailing list

>>>> ceph-users@xxxxxxxxxxxxxx

>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>>

>>>> ________________________________

>>>>

>>>> PLEASE NOTE: The information contained in this electronic mail

>>>> message is intended only for the use of the designated recipient(s)

>>>> named above. If the reader of this message is not the intended

>>>> recipient, you are hereby notified that you have received this

>>>> message in error and that any review, dissemination, distribution,

>>>> or copying of this message is strictly prohibited. If you have

>>>> received this communication in error, please notify the sender by

>>>> telephone or e-mail (as shown above) immediately and destroy any and

>>>> all copies of this message in your possession (whether hard copies

>>>> or electronically stored copies).

>>>>

>>>

>>>

>>>

>>> --

>>> -Pushpesh

>>>

>>>

>>>

>>

>>

>>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
С уважением, Фасихов Ирек НургаязовичМоб.: +79229045757

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com