[Single OSD performance on SSD] Can't go over 3, 2K IOPS

jian.zhang@xxxxxxxxx (Zhang, Jian) · Thu, 25 Sep 2014 03:51:06 +0000

We haven't tried Giant yet...

Thanks
Jian

-----Original Message-----
From: Sebastien Han [mailto:sebastien.han@xxxxxxxxxxxx] 
Sent: Tuesday, September 23, 2014 11:42 PM
To: Zhang, Jian
Cc: Alexandre DERUMIER; ceph-users at lists.ceph.com
Subject: Re: [Single OSD performance on SSD] Can't go over 3, 2K IOPS

What about writes with Giant?

On 18 Sep 2014, at 08:12, Zhang, Jian <jian.zhang at intel.com> wrote:

> Have anyone ever testing multi volume performance on a *FULL* SSD setup?
> We are able to get ~18K IOPS for 4K random read on a single volume with fio (with rbd engine) on a 12x DC3700 Setup, but only able to get ~23K (peak) IOPS even with multiple volumes. 
> Seems the maximum random write performance we can get on the entire cluster is quite close to single volume performance. 
> 
> Thanks
> Jian
> 
> 
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces at lists.ceph.com] On Behalf 
> Of Sebastien Han
> Sent: Tuesday, September 16, 2014 9:33 PM
> To: Alexandre DERUMIER
> Cc: ceph-users at lists.ceph.com
> Subject: Re: [Single OSD performance on SSD] Can't go 
> over 3, 2K IOPS
> 
> Hi,
> 
> Thanks for keeping us updated on this subject.
> dsync is definitely killing the ssd.
> 
> I don't have much to add, I'm just surprised that you're only getting 5299 with 0.85 since I've been able to get 6,4K, well I was using the 200GB model, that might explain this.
> 
> 
> On 12 Sep 2014, at 16:32, Alexandre DERUMIER <aderumier at odiso.com> wrote:
> 
>> here the results for the intel s3500
>> ------------------------------------
>> max performance is with ceph 0.85 + optracker disabled.
>> intel s3500 don't have d_sync problem like crucial
>> 
>> %util show almost 100% for read and write, so maybe the ssd disk performance is the limit.
>> 
>> I have some stec zeusram 8GB in stock (I used them for zfs zil), I'll try to bench them next week.
>> 
>> 
>> 
>> 
>> 
>> 
>> INTEL s3500
>> -----------
>> raw disk
>> --------
>> 
>> randread: fio --filename=/dev/sdb --direct=1 --rw=randread --bs=4k
>> --iodepth=32 --group_reporting --invalidate=0 --name=abc 
>> --ioengine=aio bw=288207KB/s, iops=72051
>> 
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> sdb               0,00     0,00 73454,00    0,00 293816,00     0,00     8,00    30,96    0,42    0,42    0,00   0,01  99,90
>> 
>> randwrite: fio --filename=/dev/sdb --direct=1 --rw=randwrite --bs=4k
>> --iodepth=32 --group_reporting --invalidate=0 --name=abc --ioengine=aio --sync=1 bw=48131KB/s, iops=12032
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> sdb               0,00     0,00    0,00 24120,00     0,00 48240,00     4,00     2,08    0,09    0,00    0,09   0,04 100,00
>> 
>> 
>> ceph 0.80
>> ---------
>> randread: no tuning:  bw=24578KB/s, iops=6144
>> 
>> 
>> randwrite: bw=10358KB/s, iops=2589
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> sdb               0,00   373,00    0,00 8878,00     0,00 34012,50     7,66     1,63    0,18    0,00    0,18   0,06  50,90
>> 
>> 
>> ceph 0.85 :
>> ---------
>> 
>> randread :  bw=41406KB/s, iops=10351
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> sdb               2,00     0,00 10425,00    0,00 41816,00     0,00     8,02     1,36    0,13    0,13    0,00   0,07  75,90
>> 
>> randwrite : bw=17204KB/s, iops=4301
>> 
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> sdb               0,00   333,00    0,00 9788,00     0,00 57909,00    11,83     1,46    0,15    0,00    0,15   0,07  67,80
>> 
>> 
>> ceph 0.85 tuning op_tracker=false
>> ----------------
>> 
>> randread :  bw=86537KB/s, iops=21634
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> sdb              25,00     0,00 21428,00    0,00 86444,00     0,00     8,07     3,13    0,15    0,15    0,00   0,05  98,00
>> 
>> randwrite:  bw=21199KB/s, iops=5299
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> sdb               0,00  1563,00    0,00 9880,00     0,00 75223,50    15,23     2,09    0,21    0,00    0,21   0,07  80,00
>> 
>> 
>> ----- Mail original -----
>> 
>> De: "Alexandre DERUMIER" <aderumier at odiso.com>
>> ?: "Cedric Lemarchand" <cedric at yipikai.org>
>> Cc: ceph-users at lists.ceph.com
>> Envoy?: Vendredi 12 Septembre 2014 08:15:08
>> Objet: Re: [Single OSD performance on SSD] Can't go over 
>> 3, 2K IOPS
>> 
>> results of fio on rbd with kernel patch
>> 
>> 
>> 
>> fio rbd crucial m550 1 osd 0.85 (osd_enable_op_tracker true or false, same result):
>> ---------------------------
>> bw=12327KB/s, iops=3081
>> 
>> So no much better than before, but this time, iostat show only 15% 
>> utils, and latencies are lower
>> 
>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await 
>> r_await w_await svctm %util sdb 0,00 29,00 0,00 3075,00 0,00 36748,50
>> 23,90 0,29 0,10 0,00 0,10 0,05 15,20
>> 
>> 
>> So, the write bottleneck seem to be in ceph.
>> 
>> 
>> 
>> I will send s3500 result today
>> 
>> ----- Mail original -----
>> 
>> De: "Alexandre DERUMIER" <aderumier at odiso.com>
>> ?: "Cedric Lemarchand" <cedric at yipikai.org>
>> Cc: ceph-users at lists.ceph.com
>> Envoy?: Vendredi 12 Septembre 2014 07:58:05
>> Objet: Re: [Single OSD performance on SSD] Can't go over 
>> 3, 2K IOPS
>> 
>>>> For crucial, I'll try to apply the patch from stefan priebe, to 
>>>> ignore flushes (as crucial m550 have supercaps)
>>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-November/0
>>>> 3
>>>> 5707.html
>> Here the results, disable cache flush
>> 
>> crucial m550
>> ------------
>> #fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 
>> --group_reporting --invalidate=0 --name=ab --sync=1 bw=177575KB/s,
>> iops=44393
>> 
>> 
>> ----- Mail original -----
>> 
>> De: "Alexandre DERUMIER" <aderumier at odiso.com>
>> ?: "Cedric Lemarchand" <cedric at yipikai.org>
>> Cc: ceph-users at lists.ceph.com
>> Envoy?: Vendredi 12 Septembre 2014 04:55:21
>> Objet: Re: [Single OSD performance on SSD] Can't go over 
>> 3, 2K IOPS
>> 
>> Hi,
>> seem that intel s3500 perform a lot better with o_dsync
>> 
>> crucial m550
>> ------------
>> #fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 
>> --group_reporting --invalidate=0 --name=ab --sync=1 bw=1249.9KB/s,
>> iops=312
>> 
>> intel s3500
>> -----------
>> fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 
>> --group_reporting --invalidate=0 --name=ab --sync=1 #bw=41794KB/s,
>> iops=10448
>> 
>> ok, so 30x faster.
>> 
>> 
>> 
>> For crucial, I have try to apply the patch from stefan priebe, to 
>> ignore flushes (as crucial m550 have supercaps)
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-November/035
>> 7 07.html Coming from zfs, this sound like "zfs_nocacheflush"
>> 
>> Now results:
>> 
>> crucial m550
>> ------------
>> #fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 
>> --group_reporting --invalidate=0 --name=ab --sync=1 bw=177575KB/s,
>> iops=44393
>> 
>> 
>> 
>> fio rbd crucial m550 1 osd 0.85 (osd_enable_op_tracker true or false, same result):
>> ---------------------------
>> bw=12327KB/s, iops=3081
>> 
>> So no much better than before, but this time, iostat show only 15% 
>> utils, and latencies are lower
>> 
>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await 
>> r_await w_await svctm %util sdb 0,00 29,00 0,00 3075,00 0,00 36748,50
>> 23,90 0,29 0,10 0,00 0,10 0,05 15,20
>> 
>> 
>> So, the write bottleneck seem to be in ceph.
>> 
>> 
>> 
>> I will send s3500 result today
>> 
>> ----- Mail original -----
>> 
>> De: "Cedric Lemarchand" <cedric at yipikai.org>
>> ?: ceph-users at lists.ceph.com
>> Envoy?: Jeudi 11 Septembre 2014 21:23:23
>> Objet: Re: [Single OSD performance on SSD] Can't go over 
>> 3, 2K IOPS
>> 
>> 
>> Le 11/09/2014 19:33, Cedric Lemarchand a ?crit :
>>> Le 11/09/2014 08:20, Alexandre DERUMIER a ?crit :
>>>> Hi Sebastien,
>>>> 
>>>> here my first results with crucial m550 (I'll send result with intel s3500 later):
>>>> 
>>>> - 3 nodes
>>>> - dell r620 without expander backplane
>>>> - sas controller : lsi LSI 9207 (no hardware raid or cache)
>>>> - 2 x E5-2603v2 1.8GHz (4cores)
>>>> - 32GB ram
>>>> - network : 2xgigabit link lacp + 2xgigabit lacp for cluster replication.
>>>> 
>>>> -os : debian wheezy, with kernel 3.10
>>>> 
>>>> os + ceph mon : 2x intel s3500 100gb linux soft raid osd : crucial
>>>> m550 (1TB).
>>>> 
>>>> 
>>>> 3mon in the ceph cluster,
>>>> and 1 osd (journal and datas on same disk)
>>>> 
>>>> 
>>>> ceph.conf
>>>> ---------
>>>> debug_lockdep = 0/0
>>>> debug_context = 0/0
>>>> debug_crush = 0/0
>>>> debug_buffer = 0/0
>>>> debug_timer = 0/0
>>>> debug_filer = 0/0
>>>> debug_objecter = 0/0
>>>> debug_rados = 0/0
>>>> debug_rbd = 0/0
>>>> debug_journaler = 0/0
>>>> debug_objectcatcher = 0/0
>>>> debug_client = 0/0
>>>> debug_osd = 0/0
>>>> debug_optracker = 0/0
>>>> debug_objclass = 0/0
>>>> debug_filestore = 0/0
>>>> debug_journal = 0/0
>>>> debug_ms = 0/0
>>>> debug_monc = 0/0
>>>> debug_tp = 0/0
>>>> debug_auth = 0/0
>>>> debug_finisher = 0/0
>>>> debug_heartbeatmap = 0/0
>>>> debug_perfcounter = 0/0
>>>> debug_asok = 0/0
>>>> debug_throttle = 0/0
>>>> debug_mon = 0/0
>>>> debug_paxos = 0/0
>>>> debug_rgw = 0/0
>>>> osd_op_threads = 5
>>>> filestore_op_threads = 4
>>>> 
>>>> ms_nocrc = true
>>>> cephx sign messages = false
>>>> cephx require signatures = false
>>>> 
>>>> ms_dispatch_throttle_bytes = 0
>>>> 
>>>> #0.85
>>>> throttler_perf_counter = false
>>>> filestore_fd_cache_size = 64
>>>> filestore_fd_cache_shards = 32
>>>> osd_op_num_threads_per_shard = 1
>>>> osd_op_num_shards = 25
>>>> osd_enable_op_tracker = true
>>>> 
>>>> 
>>>> 
>>>> Fio disk 4K benchmark
>>>> ------------------
>>>> rand read 4k : fio --filename=/dev/sdb --direct=1 --rw=randread 
>>>> --bs=4k --iodepth=32 --group_reporting --invalidate=0 --name=abc 
>>>> --ioengine=aio bw=271755KB/s, iops=67938
>>>> 
>>>> rand write 4k : fio --filename=/dev/sdb --direct=1 --rw=randwrite 
>>>> --bs=4k --iodepth=32 --group_reporting --invalidate=0 --name=abc 
>>>> --ioengine=aio bw=228293KB/s, iops=57073
>>>> 
>>>> 
>>>> 
>>>> fio osd benchmark (through librbd)
>>>> ----------------------------------
>>>> [global]
>>>> ioengine=rbd
>>>> clientname=admin
>>>> pool=test
>>>> rbdname=test
>>>> invalidate=0 # mandatory
>>>> rw=randwrite
>>>> rw=randread
>>>> bs=4k
>>>> direct=1
>>>> numjobs=4
>>>> group_reporting=1
>>>> 
>>>> [rbd_iodepth32]
>>>> iodepth=32
>>>> 
>>>> 
>>>> 
>>>> FIREFLY RESULTS
>>>> ----------------
>>>> fio randwrite : bw=5009.6KB/s, iops=1252
>>>> 
>>>> fio randread: bw=37820KB/s, iops=9455
>>>> 
>>>> 
>>>> 
>>>> O.85 RESULTS
>>>> ------------
>>>> 
>>>> fio randwrite : bw=11658KB/s, iops=2914
>>>> 
>>>> fio randread : bw=38642KB/s, iops=9660
>>>> 
>>>> 
>>>> 
>>>> 0.85 + osd_enable_op_tracker=false
>>>> -----------------------------------
>>>> fio randwrite : bw=11630KB/s, iops=2907 fio randread : 
>>>> bw=80606KB/s, iops=20151, (cpu 100% - GREAT !)
>>>> 
>>>> 
>>>> 
>>>> So, for read, seem that osd_enable_op_tracker is the bottleneck.
>>>> 
>>>> 
>>>> Now for write, I really don't understand why it's so low.
>>>> 
>>>> 
>>>> I have done some iostat:
>>>> 
>>>> 
>>>> FIO directly on /dev/sdb
>>>> bw=228293KB/s, iops=57073
>>>> 
>>>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await 
>>>> r_await w_await svctm %util sdb 0,00 0,00 0,00 63613,00 0,00
>>>> 254452,00 8,00 31,24 0,49 0,00 0,49 0,02 100,00
>>>> 
>>>> 
>>>> FIO directly on osd through librbd
>>>> bw=11658KB/s, iops=2914
>>>> 
>>>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await 
>>>> r_await w_await svctm %util sdb 0,00 355,00 0,00 5225,00 0,00
>>>> 29678,00 11,36 57,63 11,03 0,00 11,03 0,19 99,70
>>>> 
>>>> 
>>>> (I don't understand what exactly is %util, 100% in the 2 cases, 
>>>> because 10x slower with ceph)
>>> It would be interesting if you could catch the size of writes on SSD 
>>> during the bench through librbd (I know nmon can do that)
>> Replying to myself ... I ask a bit quickly in the way we already have 
>> this information (29678 / 5225 = 5,68Ko), but this is irrelevant.
>> 
>> Cheers
>> 
>>>> It could be a dsync problem, result seem pretty poor
>>>> 
>>>> # dd if=rand.file of=/dev/sdb bs=4k count=65536 oflag=direct
>>>> 65536+0 enregistrements lus
>>>> 65536+0 enregistrements ?crits
>>>> 268435456 octets (268 MB) copi?s, 2,77433 s, 96,8 MB/s
>>>> 
>>>> 
>>>> # dd if=rand.file of=/dev/sdb bs=4k count=65536 oflag=dsync,direct
>>>> ^C17228+0 enregistrements lus
>>>> 17228+0 enregistrements ?crits
>>>> 70565888 octets (71 MB) copi?s, 70,4098 s, 1,0 MB/s
>>>> 
>>>> 
>>>> 
>>>> I'll do tests with intel s3500 tomorrow to compare
>>>> 
>>>> ----- Mail original -----
>>>> 
>>>> De: "Sebastien Han" <sebastien.han at enovance.com>
>>>> ?: "Warren Wang" <Warren_Wang at cable.comcast.com>
>>>> Cc: ceph-users at lists.ceph.com
>>>> Envoy?: Lundi 8 Septembre 2014 22:58:25
>>>> Objet: Re: [Single OSD performance on SSD] Can't go 
>>>> over 3, 2K IOPS
>>>> 
>>>> They definitely are Warren!
>>>> 
>>>> Thanks for bringing this here :).
>>>> 
>>>> On 05 Sep 2014, at 23:02, Wang, Warren <Warren_Wang at cable.comcast.com> wrote:
>>>> 
>>>>> +1 to what Cedric said.
>>>>> 
>>>>> Anything more than a few minutes of heavy sustained writes tended to get our solid state devices into a state where garbage collection could not keep up. Originally we used small SSDs and did not overprovision the journals by much. Manufacturers publish their SSD stats, and then in very small font, state that the attained IOPS are with empty drives, and the tests are only run for very short amounts of time. Even if the drives are new, it's a good idea to perform an hdparm secure erase on them (so that the SSD knows that the blocks are truly unused), and then overprovision them. You'll know if you have a problem by watching for utilization and wait data on the journals.
>>>>> 
>>>>> One of the other interesting performance issues is that the Intel 10Gbe NICs + default kernel that we typically use max out around 1million packets/sec. It's worth tracking this metric to if you are close.
>>>>> 
>>>>> I know these aren't necessarily relevant to the test parameters you gave below, but they're worth keeping in mind.
>>>>> 
>>>>> --
>>>>> Warren Wang
>>>>> Comcast Cloud (OpenStack)
>>>>> 
>>>>> 
>>>>> From: Cedric Lemarchand <cedric at yipikai.org>
>>>>> Date: Wednesday, September 3, 2014 at 5:14 PM
>>>>> To: "ceph-users at lists.ceph.com" <ceph-users at lists.ceph.com>
>>>>> Subject: Re: [Single OSD performance on SSD] Can't go 
>>>>> over 3, 2K IOPS
>>>>> 
>>>>> 
>>>>> Le 03/09/2014 22:11, Sebastien Han a ?crit :
>>>>>> Hi Warren,
>>>>>> 
>>>>>> What do mean exactly by secure erase? At the firmware level with constructor softwares?
>>>>>> SSDs were pretty new so I don't we hit that sort of things. I believe that only aged SSDs have this behaviour but I might be wrong.
>>>>>> 
>>>>> Sorry I forgot to reply to the real question ;-) So yes it only 
>>>>> plays after some times, for your case, if the SSD still delivers write IOPS specified by the manufacturer, it will doesn't help in any ways.
>>>>> 
>>>>> But it seems this practice is nowadays increasingly used.
>>>>> 
>>>>> Cheers
>>>>>> On 02 Sep 2014, at 18:23, Wang, Warren 
>>>>>> <Warren_Wang at cable.comcast.com>
>>>>>> wrote:
>>>>>> 
>>>>>> 
>>>>>>> Hi Sebastien,
>>>>>>> 
>>>>>>> Something I didn't see in the thread so far, did you secure erase the SSDs before they got used? I assume these were probably repurposed for this test. We have seen some pretty significant garbage collection issue on various SSD and other forms of solid state storage to the point where we are overprovisioning pretty much every solid state device now. By as much as 50% to handle sustained write operations. Especially important for the journals, as we've found.
>>>>>>> 
>>>>>>> Maybe not an issue on the short fio run below, but certainly evident on longer runs or lots of historical data on the drives. The max transaction time looks pretty good for your test. Something to consider though.
>>>>>>> 
>>>>>>> Warren
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: ceph-users [
>>>>>>> mailto:ceph-users-bounces at lists.ceph.com
>>>>>>> ] On Behalf Of Sebastien Han
>>>>>>> Sent: Thursday, August 28, 2014 12:12 PM
>>>>>>> To: ceph-users
>>>>>>> Cc: Mark Nelson
>>>>>>> Subject: [Single OSD performance on SSD] Can't go 
>>>>>>> over 3, 2K IOPS
>>>>>>> 
>>>>>>> Hey all,
>>>>>>> 
>>>>>>> It has been a while since the last thread performance related on the ML :p I've been running some experiment to see how much I can get from an SSD on a Ceph cluster.
>>>>>>> To achieve that I did something pretty simple:
>>>>>>> 
>>>>>>> * Debian wheezy 7.6
>>>>>>> * kernel from debian 3.14-0.bpo.2-amd64
>>>>>>> * 1 cluster, 3 mons (i'd like to keep this realistic since in a 
>>>>>>> real deployment i'll use 3)
>>>>>>> * 1 OSD backed by an SSD (journal and osd data on the same
>>>>>>> device)
>>>>>>> * 1 replica count of 1
>>>>>>> * partitions are perfectly aligned
>>>>>>> * io scheduler is set to noon but deadline was showing the same 
>>>>>>> results
>>>>>>> * no updatedb running
>>>>>>> 
>>>>>>> About the box:
>>>>>>> 
>>>>>>> * 32GB of RAM
>>>>>>> * 12 cores with HT @ 2,4 GHz
>>>>>>> * WB cache is enabled on the controller
>>>>>>> * 10Gbps network (doesn't help here)
>>>>>>> 
>>>>>>> The SSD is a 200G Intel DC S3700 and is capable of delivering around 29K iops with random 4k writes (my fio results) As a benchmark tool I used fio with the rbd engine (thanks deutsche telekom guys!).
>>>>>>> 
>>>>>>> O_DIECT and D_SYNC don't seem to be a problem for the SSD:
>>>>>>> 
>>>>>>> # dd if=/dev/urandom of=rand.file bs=4k count=65536
>>>>>>> 65536+0 records in
>>>>>>> 65536+0 records out
>>>>>>> 268435456 bytes (268 MB) copied, 29.5477 s, 9.1 MB/s
>>>>>>> 
>>>>>>> # du -sh rand.file
>>>>>>> 256M rand.file
>>>>>>> 
>>>>>>> # dd if=rand.file of=/dev/sdo bs=4k count=65536 
>>>>>>> oflag=dsync,direct
>>>>>>> 65536+0 records in
>>>>>>> 65536+0 records out
>>>>>>> 268435456 bytes (268 MB) copied, 2.73628 s, 98.1 MB/s
>>>>>>> 
>>>>>>> See my ceph.conf:
>>>>>>> 
>>>>>>> [global]
>>>>>>> auth cluster required = cephx
>>>>>>> auth service required = cephx
>>>>>>> auth client required = cephx
>>>>>>> fsid = 857b8609-8c9b-499e-9161-2ea67ba51c97
>>>>>>> osd pool default pg num = 4096
>>>>>>> osd pool default pgp num = 4096
>>>>>>> osd pool default size = 2
>>>>>>> osd crush chooseleaf type = 0
>>>>>>> 
>>>>>>> debug lockdep = 0/0
>>>>>>> debug context = 0/0
>>>>>>> debug crush = 0/0
>>>>>>> debug buffer = 0/0
>>>>>>> debug timer = 0/0
>>>>>>> debug journaler = 0/0
>>>>>>> debug osd = 0/0
>>>>>>> debug optracker = 0/0
>>>>>>> debug objclass = 0/0
>>>>>>> debug filestore = 0/0
>>>>>>> debug journal = 0/0
>>>>>>> debug ms = 0/0
>>>>>>> debug monc = 0/0
>>>>>>> debug tp = 0/0
>>>>>>> debug auth = 0/0
>>>>>>> debug finisher = 0/0
>>>>>>> debug heartbeatmap = 0/0
>>>>>>> debug perfcounter = 0/0
>>>>>>> debug asok = 0/0
>>>>>>> debug throttle = 0/0
>>>>>>> 
>>>>>>> [mon]
>>>>>>> mon osd down out interval = 600
>>>>>>> mon osd min down reporters = 13
>>>>>>> [mon.ceph-01]
>>>>>>> host = ceph-01
>>>>>>> mon addr = 172.20.20.171
>>>>>>> [mon.ceph-02]
>>>>>>> host = ceph-02
>>>>>>> mon addr = 172.20.20.172
>>>>>>> [mon.ceph-03]
>>>>>>> host = ceph-03
>>>>>>> mon addr = 172.20.20.173
>>>>>>> 
>>>>>>> debug lockdep = 0/0
>>>>>>> debug context = 0/0
>>>>>>> debug crush = 0/0
>>>>>>> debug buffer = 0/0
>>>>>>> debug timer = 0/0
>>>>>>> debug journaler = 0/0
>>>>>>> debug osd = 0/0
>>>>>>> debug optracker = 0/0
>>>>>>> debug objclass = 0/0
>>>>>>> debug filestore = 0/0
>>>>>>> debug journal = 0/0
>>>>>>> debug ms = 0/0
>>>>>>> debug monc = 0/0
>>>>>>> debug tp = 0/0
>>>>>>> debug auth = 0/0
>>>>>>> debug finisher = 0/0
>>>>>>> debug heartbeatmap = 0/0
>>>>>>> debug perfcounter = 0/0
>>>>>>> debug asok = 0/0
>>>>>>> debug throttle = 0/0
>>>>>>> 
>>>>>>> [osd]
>>>>>>> osd mkfs type = xfs
>>>>>>> osd mkfs options xfs = -f -i size=2048 osd mount options xfs = 
>>>>>>> rw,noatime,logbsize=256k,delaylog osd journal size = 20480 
>>>>>>> cluster_network = 172.20.20.0/24 public_network = 172.20.20.0/24 
>>>>>>> osd mon heartbeat interval = 30 # Performance tuning filestore 
>>>>>>> merge threshold = 40 filestore split multiple = 8 osd op threads 
>>>>>>> = 8 # Recovery tuning osd recovery max active = 1 osd max 
>>>>>>> backfills = 1 osd recovery op priority = 1
>>>>>>> 
>>>>>>> 
>>>>>>> debug lockdep = 0/0
>>>>>>> debug context = 0/0
>>>>>>> debug crush = 0/0
>>>>>>> debug buffer = 0/0
>>>>>>> debug timer = 0/0
>>>>>>> debug journaler = 0/0
>>>>>>> debug osd = 0/0
>>>>>>> debug optracker = 0/0
>>>>>>> debug objclass = 0/0
>>>>>>> debug filestore = 0/0
>>>>>>> debug journal = 0/0
>>>>>>> debug ms = 0/0
>>>>>>> debug monc = 0/0
>>>>>>> debug tp = 0/0
>>>>>>> debug auth = 0/0
>>>>>>> debug finisher = 0/0
>>>>>>> debug heartbeatmap = 0/0
>>>>>>> debug perfcounter = 0/0
>>>>>>> debug asok = 0/0
>>>>>>> debug throttle = 0/0
>>>>>>> 
>>>>>>> Disabling all debugging made me win 200/300 more IOPS.
>>>>>>> 
>>>>>>> See my fio template:
>>>>>>> 
>>>>>>> [global]
>>>>>>> #logging
>>>>>>> #write_iops_log=write_iops_log
>>>>>>> #write_bw_log=write_bw_log
>>>>>>> #write_lat_log=write_lat_lo
>>>>>>> 
>>>>>>> time_based
>>>>>>> runtime=60
>>>>>>> 
>>>>>>> ioengine=rbd
>>>>>>> clientname=admin
>>>>>>> pool=test
>>>>>>> rbdname=fio
>>>>>>> invalidate=0 # mandatory
>>>>>>> #rw=randwrite
>>>>>>> rw=write
>>>>>>> bs=4k
>>>>>>> #bs=32m
>>>>>>> size=5G
>>>>>>> group_reporting
>>>>>>> 
>>>>>>> [rbd_iodepth32]
>>>>>>> iodepth=32
>>>>>>> direct=1
>>>>>>> 
>>>>>>> See my rio output:
>>>>>>> 
>>>>>>> rbd_iodepth32: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, 
>>>>>>> ioengine=rbd, iodepth=32 fio-2.1.11-14-gb74e Starting 1 process 
>>>>>>> rbd engine: RBD version: 0.1.8
>>>>>>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/12876KB/0KB /s]
>>>>>>> [0/3219/0 iops] [eta 00m:00s]
>>>>>>> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=32116: Thu Aug 
>>>>>>> 28
>>>>>>> 00:28:26 2014
>>>>>>> write: io=771448KB, bw=12855KB/s, iops=3213, runt= 60010msec 
>>>>>>> slat
>>>>>>> (usec): min=42, max=1578, avg=66.50, stdev=16.96 clat (msec): 
>>>>>>> min=1, max=28, avg= 9.85, stdev= 1.48 lat (msec): min=1, max=28, 
>>>>>>> avg= 9.92, stdev= 1.47 clat percentiles (usec):
>>>>>>> | 1.00th=[ 6368], 5.00th=[ 8256], 10.00th=[ 8640], 20.00th=[ 
>>>>>>> | 9152], 30.00th=[ 9408], 40.00th=[ 9664], 50.00th=[ 9792], 
>>>>>>> | 60.00th=[10048], 70.00th=[10176], 80.00th=[10560], 
>>>>>>> | 90.00th=[10944], 95.00th=[11456], 99.00th=[13120], 
>>>>>>> | 99.50th=[16768], 99.90th=[25984], 99.95th=[27008], 
>>>>>>> | 99.99th=[28032]
>>>>>>> bw (KB /s): min=11864, max=13808, per=100.00%, avg=12864.36,
>>>>>>> stdev=407.35 lat (msec) : 2=0.03%, 4=0.54%, 10=59.79%, 
>>>>>>> 20=39.24%, 50=0.41% cpu : usr=19.15%, sys=4.69%, ctx=326309, 
>>>>>>> majf=0,
>>>>>>> minf=426088 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 
>>>>>>> 16=33.9%, 32=66.1%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 
>>>>>>> 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=99.6%, 
>>>>>>> 8=0.4%, 16=0.1%, 32=0.1%, 64=0.0%, >=64=0.0% issued :
>>>>>>> total=r=0/w=192862/d=0, short=r=0/w=0/d=0 latency : target=0, 
>>>>>>> window=0, percentile=100.00%, depth=32
>>>>>>> 
>>>>>>> Run status group 0 (all jobs):
>>>>>>> WRITE: io=771448KB, aggrb=12855KB/s, minb=12855KB/s, 
>>>>>>> maxb=12855KB/s, mint=60010msec, maxt=60010msec
>>>>>>> 
>>>>>>> Disk stats (read/write):
>>>>>>> dm-1: ios=0/49, merge=0/0, ticks=0/12, in_queue=12, util=0.01%, 
>>>>>>> aggrios=0/22, aggrmerge=0/27, aggrticks=0/12, aggrin_queue=12, 
>>>>>>> aggrutil=0.01%
>>>>>>> sda: ios=0/22, merge=0/27, ticks=0/12, in_queue=12, util=0.01%
>>>>>>> 
>>>>>>> I tried to tweak several parameters like:
>>>>>>> 
>>>>>>> filestore_wbthrottle_xfs_ios_start_flusher = 10000 
>>>>>>> filestore_wbthrottle_xfs_ios_hard_limit = 10000 
>>>>>>> filestore_wbthrottle_btrfs_ios_start_flusher = 10000 
>>>>>>> filestore_wbthrottle_btrfs_ios_hard_limit = 10000 filestore 
>>>>>>> queue max ops = 2000
>>>>>>> 
>>>>>>> But didn't any improvement.
>>>>>>> 
>>>>>>> Then I tried other things:
>>>>>>> 
>>>>>>> * Increasing the io_depth up to 256 or 512 gave me between 50 to 100 more IOPS but it's not a realistic workload anymore and not that significant.
>>>>>>> * adding another SSD for the journal, still getting 3,2K IOPS
>>>>>>> * I tried with rbd bench and I also got 3K IOPS
>>>>>>> * I ran the test on a client machine and then locally on the 
>>>>>>> server, still getting 3,2K IOPS
>>>>>>> * put the journal in memory, still getting 3,2K IOPS
>>>>>>> * with 2 clients running the test in parallel I got a total of 
>>>>>>> 3,6K IOPS but I don't seem to be able to go over
>>>>>>> * I tried is to add another OSD to that SSD, so I had 2 OSD and 2 journals on 1 SSD, got 4,5K IOPS YAY!
>>>>>>> 
>>>>>>> Given the results of the last time it seems that something is limiting the number of IOPS per OSD process.
>>>>>>> 
>>>>>>> Running the test on a client or locally didn't show any difference.
>>>>>>> So it looks to me that there is some contention within Ceph that might cause this.
>>>>>>> 
>>>>>>> I also ran perf and looked at the output, everything looks decent, but someone might want to have a look at it :).
>>>>>>> 
>>>>>>> We have been able to reproduce this on 3 distinct platforms with some deviations (because of the hardware) but the behaviour is the same.
>>>>>>> Any thoughts will be highly appreciated, only getting 3,2k out of an 29K IOPS SSD is a bit frustrating :).
>>>>>>> 
>>>>>>> Cheers.
>>>>>>> ----
>>>>>>> S?bastien Han
>>>>>>> Cloud Architect
>>>>>>> 
>>>>>>> "Always give 100%. Unless you're giving blood."
>>>>>>> 
>>>>>>> Phone: +33 (0)1 49 70 99 72
>>>>>>> Mail:
>>>>>>> sebastien.han at enovance.com
>>>>>>> 
>>>>>>> Address : 11 bis, rue Roqu?pine - 75008 Paris Web :
>>>>>>> www.enovance.com
>>>>>>> - Twitter : @enovance
>>>>>>> 
>>>>>>> 
>>>>>> Cheers.
>>>>>> ----
>>>>>> S?bastien Han
>>>>>> Cloud Architect
>>>>>> 
>>>>>> "Always give 100%. Unless you're giving blood."
>>>>>> 
>>>>>> Phone: +33 (0)1 49 70 99 72
>>>>>> Mail:
>>>>>> sebastien.han at enovance.com
>>>>>> 
>>>>>> Address : 11 bis, rue Roqu?pine - 75008 Paris Web :
>>>>>> www.enovance.com
>>>>>> - Twitter : @enovance
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> 
>>>>>> ceph-users at lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-
>>>>>> u
>>>>>> sers-ceph.com
>>>>> --
>>>>> C?dric
>>>>> 
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users at lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> Cheers.
>>>> ----
>>>> S?bastien Han
>>>> Cloud Architect
>>>> 
>>>> "Always give 100%. Unless you're giving blood."
>>>> 
>>>> Phone: +33 (0)1 49 70 99 72
>>>> Mail: sebastien.han at enovance.com
>>>> Address : 11 bis, rue Roqu?pine - 75008 Paris Web : 
>>>> www.enovance.com
>>>> - Twitter : @enovance
>>>> 
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users at lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users at lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> --
>> C?dric
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> Cheers.
> ----
> S?bastien Han
> Cloud Architect
> 
> "Always give 100%. Unless you're giving blood."
> 
> Phone: +33 (0)1 49 70 99 72
> Mail: sebastien.han at enovance.com
> Address : 11 bis, rue Roqu?pine - 75008 Paris Web : www.enovance.com - 
> Twitter : @enovance
> 

Cheers.
----
S?bastien Han
Cloud Architect 

"Always give 100%. Unless you're giving blood."

Phone: +33 (0)1 49 70 99 72
Mail: sebastien.han at enovance.com
Address : 11 bis, rue Roqu?pine - 75008 Paris Web : www.enovance.com - Twitter : @enovance