On Wed, 24 Sep 2014 20:49:21 +0200 (CEST) Alexandre DERUMIER wrote: > >>What about writes with Giant? > > I'm around > - 4k iops (4k random) with 1osd (1 node - 1 osd) > - 8k iops (4k random) with 2 osd (1 node - 2 osd) > - 16K iops (4k random) with 4 osd (2 nodes - 2 osd by node) > - 22K iops (4k random) with 6 osd (3 nodes - 2 osd by node) > > Seem to scale, but I'm cpu bound on node (8 cores E5-2603 v2 @ 1.80GHz > 100% cpu for 2 osd) > You don't even need a full SSD cluster to see that Ceph has a lot of room for improvements, see my "Slow IOPS on RBD compared to journal and backing devices" thread in May. As Dieter asked, what replication level is this, I guess 1? Now at 3 nodes and 6 OSDs you're getting about the performance of a single SSD, food for thought. Christian > ----- Mail original ----- > > De: "Sebastien Han" <sebastien.han at enovance.com> > ?: "Jian Zhang" <jian.zhang at intel.com> > Cc: "Alexandre DERUMIER" <aderumier at odiso.com>, > ceph-users at lists.ceph.com Envoy?: Mardi 23 Septembre 2014 17:41:38 > Objet: Re: [Single OSD performance on SSD] Can't go over 3, > 2K IOPS > > What about writes with Giant? > > On 18 Sep 2014, at 08:12, Zhang, Jian <jian.zhang at intel.com> wrote: > > > Have anyone ever testing multi volume performance on a *FULL* SSD > > setup? We are able to get ~18K IOPS for 4K random read on a single > > volume with fio (with rbd engine) on a 12x DC3700 Setup, but only able > > to get ~23K (peak) IOPS even with multiple volumes. Seems the maximum > > random write performance we can get on the entire cluster is quite > > close to single volume performance. > > > > Thanks > > Jian > > > > > > -----Original Message----- > > From: ceph-users [mailto:ceph-users-bounces at lists.ceph.com] On Behalf > > Of Sebastien Han Sent: Tuesday, September 16, 2014 9:33 PM > > To: Alexandre DERUMIER > > Cc: ceph-users at lists.ceph.com > > Subject: Re: [Single OSD performance on SSD] Can't go > > over 3, 2K IOPS > > > > Hi, > > > > Thanks for keeping us updated on this subject. > > dsync is definitely killing the ssd. > > > > I don't have much to add, I'm just surprised that you're only getting > > 5299 with 0.85 since I've been able to get 6,4K, well I was using the > > 200GB model, that might explain this. > > > > > > On 12 Sep 2014, at 16:32, Alexandre DERUMIER <aderumier at odiso.com> > > wrote: > > > >> here the results for the intel s3500 > >> ------------------------------------ > >> max performance is with ceph 0.85 + optracker disabled. > >> intel s3500 don't have d_sync problem like crucial > >> > >> %util show almost 100% for read and write, so maybe the ssd disk > >> performance is the limit. > >> > >> I have some stec zeusram 8GB in stock (I used them for zfs zil), I'll > >> try to bench them next week. > >> > >> > >> > >> > >> > >> > >> INTEL s3500 > >> ----------- > >> raw disk > >> -------- > >> > >> randread: fio --filename=/dev/sdb --direct=1 --rw=randread --bs=4k > >> --iodepth=32 --group_reporting --invalidate=0 --name=abc > >> --ioengine=aio bw=288207KB/s, iops=72051 > >> > >> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await > >> r_await w_await svctm %util sdb 0,00 0,00 73454,00 0,00 293816,00 > >> 0,00 8,00 30,96 0,42 0,42 0,00 0,01 99,90 > >> > >> randwrite: fio --filename=/dev/sdb --direct=1 --rw=randwrite --bs=4k > >> --iodepth=32 --group_reporting --invalidate=0 --name=abc > >> --ioengine=aio --sync=1 bw=48131KB/s, iops=12032 Device: rrqm/s > >> wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await > >> svctm %util sdb 0,00 0,00 0,00 24120,00 0,00 48240,00 4,00 2,08 0,09 > >> 0,00 0,09 0,04 100,00 > >> > >> > >> ceph 0.80 > >> --------- > >> randread: no tuning: bw=24578KB/s, iops=6144 > >> > >> > >> randwrite: bw=10358KB/s, iops=2589 > >> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await > >> r_await w_await svctm %util sdb 0,00 373,00 0,00 8878,00 0,00 > >> 34012,50 7,66 1,63 0,18 0,00 0,18 0,06 50,90 > >> > >> > >> ceph 0.85 : > >> --------- > >> > >> randread : bw=41406KB/s, iops=10351 > >> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await > >> r_await w_await svctm %util sdb 2,00 0,00 10425,00 0,00 41816,00 0,00 > >> 8,02 1,36 0,13 0,13 0,00 0,07 75,90 > >> > >> randwrite : bw=17204KB/s, iops=4301 > >> > >> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await > >> r_await w_await svctm %util sdb 0,00 333,00 0,00 9788,00 0,00 > >> 57909,00 11,83 1,46 0,15 0,00 0,15 0,07 67,80 > >> > >> > >> ceph 0.85 tuning op_tracker=false > >> ---------------- > >> > >> randread : bw=86537KB/s, iops=21634 > >> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await > >> r_await w_await svctm %util sdb 25,00 0,00 21428,00 0,00 86444,00 > >> 0,00 8,07 3,13 0,15 0,15 0,00 0,05 98,00 > >> > >> randwrite: bw=21199KB/s, iops=5299 > >> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await > >> r_await w_await svctm %util sdb 0,00 1563,00 0,00 9880,00 0,00 > >> 75223,50 15,23 2,09 0,21 0,00 0,21 0,07 80,00 > >> > >> > >> ----- Mail original ----- > >> > >> De: "Alexandre DERUMIER" <aderumier at odiso.com> > >> ?: "Cedric Lemarchand" <cedric at yipikai.org> > >> Cc: ceph-users at lists.ceph.com > >> Envoy?: Vendredi 12 Septembre 2014 08:15:08 > >> Objet: Re: [Single OSD performance on SSD] Can't go over > >> 3, 2K IOPS > >> > >> results of fio on rbd with kernel patch > >> > >> > >> > >> fio rbd crucial m550 1 osd 0.85 (osd_enable_op_tracker true or false, > >> same result): --------------------------- > >> bw=12327KB/s, iops=3081 > >> > >> So no much better than before, but this time, iostat show only 15% > >> utils, and latencies are lower > >> > >> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await > >> r_await w_await svctm %util sdb 0,00 29,00 0,00 3075,00 0,00 36748,50 > >> 23,90 0,29 0,10 0,00 0,10 0,05 15,20 > >> > >> > >> So, the write bottleneck seem to be in ceph. > >> > >> > >> > >> I will send s3500 result today > >> > >> ----- Mail original ----- > >> > >> De: "Alexandre DERUMIER" <aderumier at odiso.com> > >> ?: "Cedric Lemarchand" <cedric at yipikai.org> > >> Cc: ceph-users at lists.ceph.com > >> Envoy?: Vendredi 12 Septembre 2014 07:58:05 > >> Objet: Re: [Single OSD performance on SSD] Can't go over > >> 3, 2K IOPS > >> > >>>> For crucial, I'll try to apply the patch from stefan priebe, to > >>>> ignore flushes (as crucial m550 have supercaps) > >>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-November/03 > >>>> 5707.html > >> Here the results, disable cache flush > >> > >> crucial m550 > >> ------------ > >> #fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 > >> --group_reporting --invalidate=0 --name=ab --sync=1 bw=177575KB/s, > >> iops=44393 > >> > >> > >> ----- Mail original ----- > >> > >> De: "Alexandre DERUMIER" <aderumier at odiso.com> > >> ?: "Cedric Lemarchand" <cedric at yipikai.org> > >> Cc: ceph-users at lists.ceph.com > >> Envoy?: Vendredi 12 Septembre 2014 04:55:21 > >> Objet: Re: [Single OSD performance on SSD] Can't go over > >> 3, 2K IOPS > >> > >> Hi, > >> seem that intel s3500 perform a lot better with o_dsync > >> > >> crucial m550 > >> ------------ > >> #fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 > >> --group_reporting --invalidate=0 --name=ab --sync=1 bw=1249.9KB/s, > >> iops=312 > >> > >> intel s3500 > >> ----------- > >> fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 > >> --group_reporting --invalidate=0 --name=ab --sync=1 #bw=41794KB/s, > >> iops=10448 > >> > >> ok, so 30x faster. > >> > >> > >> > >> For crucial, I have try to apply the patch from stefan priebe, to > >> ignore flushes (as crucial m550 have supercaps) > >> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-November/0357 > >> 07.html Coming from zfs, this sound like "zfs_nocacheflush" > >> > >> Now results: > >> > >> crucial m550 > >> ------------ > >> #fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 > >> --group_reporting --invalidate=0 --name=ab --sync=1 bw=177575KB/s, > >> iops=44393 > >> > >> > >> > >> fio rbd crucial m550 1 osd 0.85 (osd_enable_op_tracker true or false, > >> same result): --------------------------- > >> bw=12327KB/s, iops=3081 > >> > >> So no much better than before, but this time, iostat show only 15% > >> utils, and latencies are lower > >> > >> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await > >> r_await w_await svctm %util sdb 0,00 29,00 0,00 3075,00 0,00 36748,50 > >> 23,90 0,29 0,10 0,00 0,10 0,05 15,20 > >> > >> > >> So, the write bottleneck seem to be in ceph. > >> > >> > >> > >> I will send s3500 result today > >> > >> ----- Mail original ----- > >> > >> De: "Cedric Lemarchand" <cedric at yipikai.org> > >> ?: ceph-users at lists.ceph.com > >> Envoy?: Jeudi 11 Septembre 2014 21:23:23 > >> Objet: Re: [Single OSD performance on SSD] Can't go over > >> 3, 2K IOPS > >> > >> > >> Le 11/09/2014 19:33, Cedric Lemarchand a ?crit : > >>> Le 11/09/2014 08:20, Alexandre DERUMIER a ?crit : > >>>> Hi Sebastien, > >>>> > >>>> here my first results with crucial m550 (I'll send result with > >>>> intel s3500 later): > >>>> > >>>> - 3 nodes > >>>> - dell r620 without expander backplane > >>>> - sas controller : lsi LSI 9207 (no hardware raid or cache) > >>>> - 2 x E5-2603v2 1.8GHz (4cores) > >>>> - 32GB ram > >>>> - network : 2xgigabit link lacp + 2xgigabit lacp for cluster > >>>> replication. > >>>> > >>>> -os : debian wheezy, with kernel 3.10 > >>>> > >>>> os + ceph mon : 2x intel s3500 100gb linux soft raid osd : crucial > >>>> m550 (1TB). > >>>> > >>>> > >>>> 3mon in the ceph cluster, > >>>> and 1 osd (journal and datas on same disk) > >>>> > >>>> > >>>> ceph.conf > >>>> --------- > >>>> debug_lockdep = 0/0 > >>>> debug_context = 0/0 > >>>> debug_crush = 0/0 > >>>> debug_buffer = 0/0 > >>>> debug_timer = 0/0 > >>>> debug_filer = 0/0 > >>>> debug_objecter = 0/0 > >>>> debug_rados = 0/0 > >>>> debug_rbd = 0/0 > >>>> debug_journaler = 0/0 > >>>> debug_objectcatcher = 0/0 > >>>> debug_client = 0/0 > >>>> debug_osd = 0/0 > >>>> debug_optracker = 0/0 > >>>> debug_objclass = 0/0 > >>>> debug_filestore = 0/0 > >>>> debug_journal = 0/0 > >>>> debug_ms = 0/0 > >>>> debug_monc = 0/0 > >>>> debug_tp = 0/0 > >>>> debug_auth = 0/0 > >>>> debug_finisher = 0/0 > >>>> debug_heartbeatmap = 0/0 > >>>> debug_perfcounter = 0/0 > >>>> debug_asok = 0/0 > >>>> debug_throttle = 0/0 > >>>> debug_mon = 0/0 > >>>> debug_paxos = 0/0 > >>>> debug_rgw = 0/0 > >>>> osd_op_threads = 5 > >>>> filestore_op_threads = 4 > >>>> > >>>> ms_nocrc = true > >>>> cephx sign messages = false > >>>> cephx require signatures = false > >>>> > >>>> ms_dispatch_throttle_bytes = 0 > >>>> > >>>> #0.85 > >>>> throttler_perf_counter = false > >>>> filestore_fd_cache_size = 64 > >>>> filestore_fd_cache_shards = 32 > >>>> osd_op_num_threads_per_shard = 1 > >>>> osd_op_num_shards = 25 > >>>> osd_enable_op_tracker = true > >>>> > >>>> > >>>> > >>>> Fio disk 4K benchmark > >>>> ------------------ > >>>> rand read 4k : fio --filename=/dev/sdb --direct=1 --rw=randread > >>>> --bs=4k --iodepth=32 --group_reporting --invalidate=0 --name=abc > >>>> --ioengine=aio bw=271755KB/s, iops=67938 > >>>> > >>>> rand write 4k : fio --filename=/dev/sdb --direct=1 --rw=randwrite > >>>> --bs=4k --iodepth=32 --group_reporting --invalidate=0 --name=abc > >>>> --ioengine=aio bw=228293KB/s, iops=57073 > >>>> > >>>> > >>>> > >>>> fio osd benchmark (through librbd) > >>>> ---------------------------------- > >>>> [global] > >>>> ioengine=rbd > >>>> clientname=admin > >>>> pool=test > >>>> rbdname=test > >>>> invalidate=0 # mandatory > >>>> rw=randwrite > >>>> rw=randread > >>>> bs=4k > >>>> direct=1 > >>>> numjobs=4 > >>>> group_reporting=1 > >>>> > >>>> [rbd_iodepth32] > >>>> iodepth=32 > >>>> > >>>> > >>>> > >>>> FIREFLY RESULTS > >>>> ---------------- > >>>> fio randwrite : bw=5009.6KB/s, iops=1252 > >>>> > >>>> fio randread: bw=37820KB/s, iops=9455 > >>>> > >>>> > >>>> > >>>> O.85 RESULTS > >>>> ------------ > >>>> > >>>> fio randwrite : bw=11658KB/s, iops=2914 > >>>> > >>>> fio randread : bw=38642KB/s, iops=9660 > >>>> > >>>> > >>>> > >>>> 0.85 + osd_enable_op_tracker=false > >>>> ----------------------------------- > >>>> fio randwrite : bw=11630KB/s, iops=2907 fio randread : > >>>> bw=80606KB/s, iops=20151, (cpu 100% - GREAT !) > >>>> > >>>> > >>>> > >>>> So, for read, seem that osd_enable_op_tracker is the bottleneck. > >>>> > >>>> > >>>> Now for write, I really don't understand why it's so low. > >>>> > >>>> > >>>> I have done some iostat: > >>>> > >>>> > >>>> FIO directly on /dev/sdb > >>>> bw=228293KB/s, iops=57073 > >>>> > >>>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await > >>>> r_await w_await svctm %util sdb 0,00 0,00 0,00 63613,00 0,00 > >>>> 254452,00 8,00 31,24 0,49 0,00 0,49 0,02 100,00 > >>>> > >>>> > >>>> FIO directly on osd through librbd > >>>> bw=11658KB/s, iops=2914 > >>>> > >>>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await > >>>> r_await w_await svctm %util sdb 0,00 355,00 0,00 5225,00 0,00 > >>>> 29678,00 11,36 57,63 11,03 0,00 11,03 0,19 99,70 > >>>> > >>>> > >>>> (I don't understand what exactly is %util, 100% in the 2 cases, > >>>> because 10x slower with ceph) > >>> It would be interesting if you could catch the size of writes on SSD > >>> during the bench through librbd (I know nmon can do that) > >> Replying to myself ... I ask a bit quickly in the way we already have > >> this information (29678 / 5225 = 5,68Ko), but this is irrelevant. > >> > >> Cheers > >> > >>>> It could be a dsync problem, result seem pretty poor > >>>> > >>>> # dd if=rand.file of=/dev/sdb bs=4k count=65536 oflag=direct > >>>> 65536+0 enregistrements lus > >>>> 65536+0 enregistrements ?crits > >>>> 268435456 octets (268 MB) copi?s, 2,77433 s, 96,8 MB/s > >>>> > >>>> > >>>> # dd if=rand.file of=/dev/sdb bs=4k count=65536 oflag=dsync,direct > >>>> ^C17228+0 enregistrements lus > >>>> 17228+0 enregistrements ?crits > >>>> 70565888 octets (71 MB) copi?s, 70,4098 s, 1,0 MB/s > >>>> > >>>> > >>>> > >>>> I'll do tests with intel s3500 tomorrow to compare > >>>> > >>>> ----- Mail original ----- > >>>> > >>>> De: "Sebastien Han" <sebastien.han at enovance.com> > >>>> ?: "Warren Wang" <Warren_Wang at cable.comcast.com> > >>>> Cc: ceph-users at lists.ceph.com > >>>> Envoy?: Lundi 8 Septembre 2014 22:58:25 > >>>> Objet: Re: [Single OSD performance on SSD] Can't go > >>>> over 3, 2K IOPS > >>>> > >>>> They definitely are Warren! > >>>> > >>>> Thanks for bringing this here :). > >>>> > >>>> On 05 Sep 2014, at 23:02, Wang, Warren > >>>> <Warren_Wang at cable.comcast.com> wrote: > >>>> > >>>>> +1 to what Cedric said. > >>>>> > >>>>> Anything more than a few minutes of heavy sustained writes tended > >>>>> to get our solid state devices into a state where garbage > >>>>> collection could not keep up. Originally we used small SSDs and > >>>>> did not overprovision the journals by much. Manufacturers publish > >>>>> their SSD stats, and then in very small font, state that the > >>>>> attained IOPS are with empty drives, and the tests are only run > >>>>> for very short amounts of time. Even if the drives are new, it's a > >>>>> good idea to perform an hdparm secure erase on them (so that the > >>>>> SSD knows that the blocks are truly unused), and then > >>>>> overprovision them. You'll know if you have a problem by watching > >>>>> for utilization and wait data on the journals. > >>>>> > >>>>> One of the other interesting performance issues is that the Intel > >>>>> 10Gbe NICs + default kernel that we typically use max out around > >>>>> 1million packets/sec. It's worth tracking this metric to if you > >>>>> are close. > >>>>> > >>>>> I know these aren't necessarily relevant to the test parameters > >>>>> you gave below, but they're worth keeping in mind. > >>>>> > >>>>> -- > >>>>> Warren Wang > >>>>> Comcast Cloud (OpenStack) > >>>>> > >>>>> > >>>>> From: Cedric Lemarchand <cedric at yipikai.org> > >>>>> Date: Wednesday, September 3, 2014 at 5:14 PM > >>>>> To: "ceph-users at lists.ceph.com" <ceph-users at lists.ceph.com> > >>>>> Subject: Re: [Single OSD performance on SSD] Can't go > >>>>> over 3, 2K IOPS > >>>>> > >>>>> > >>>>> Le 03/09/2014 22:11, Sebastien Han a ?crit : > >>>>>> Hi Warren, > >>>>>> > >>>>>> What do mean exactly by secure erase? At the firmware level with > >>>>>> constructor softwares? SSDs were pretty new so I don't we hit > >>>>>> that sort of things. I believe that only aged SSDs have this > >>>>>> behaviour but I might be wrong. > >>>>>> > >>>>> Sorry I forgot to reply to the real question ;-) So yes it only > >>>>> plays after some times, for your case, if the SSD still delivers > >>>>> write IOPS specified by the manufacturer, it will doesn't help in > >>>>> any ways. > >>>>> > >>>>> But it seems this practice is nowadays increasingly used. > >>>>> > >>>>> Cheers > >>>>>> On 02 Sep 2014, at 18:23, Wang, Warren > >>>>>> <Warren_Wang at cable.comcast.com> > >>>>>> wrote: > >>>>>> > >>>>>> > >>>>>>> Hi Sebastien, > >>>>>>> > >>>>>>> Something I didn't see in the thread so far, did you secure > >>>>>>> erase the SSDs before they got used? I assume these were > >>>>>>> probably repurposed for this test. We have seen some pretty > >>>>>>> significant garbage collection issue on various SSD and other > >>>>>>> forms of solid state storage to the point where we are > >>>>>>> overprovisioning pretty much every solid state device now. By as > >>>>>>> much as 50% to handle sustained write operations. Especially > >>>>>>> important for the journals, as we've found. > >>>>>>> > >>>>>>> Maybe not an issue on the short fio run below, but certainly > >>>>>>> evident on longer runs or lots of historical data on the drives. > >>>>>>> The max transaction time looks pretty good for your test. > >>>>>>> Something to consider though. > >>>>>>> > >>>>>>> Warren > >>>>>>> > >>>>>>> -----Original Message----- > >>>>>>> From: ceph-users [ > >>>>>>> mailto:ceph-users-bounces at lists.ceph.com > >>>>>>> ] On Behalf Of Sebastien Han > >>>>>>> Sent: Thursday, August 28, 2014 12:12 PM > >>>>>>> To: ceph-users > >>>>>>> Cc: Mark Nelson > >>>>>>> Subject: [Single OSD performance on SSD] Can't go > >>>>>>> over 3, 2K IOPS > >>>>>>> > >>>>>>> Hey all, > >>>>>>> > >>>>>>> It has been a while since the last thread performance related on > >>>>>>> the ML :p I've been running some experiment to see how much I > >>>>>>> can get from an SSD on a Ceph cluster. To achieve that I did > >>>>>>> something pretty simple: > >>>>>>> > >>>>>>> * Debian wheezy 7.6 > >>>>>>> * kernel from debian 3.14-0.bpo.2-amd64 > >>>>>>> * 1 cluster, 3 mons (i'd like to keep this realistic since in a > >>>>>>> real deployment i'll use 3) > >>>>>>> * 1 OSD backed by an SSD (journal and osd data on the same > >>>>>>> device) > >>>>>>> * 1 replica count of 1 > >>>>>>> * partitions are perfectly aligned > >>>>>>> * io scheduler is set to noon but deadline was showing the same > >>>>>>> results > >>>>>>> * no updatedb running > >>>>>>> > >>>>>>> About the box: > >>>>>>> > >>>>>>> * 32GB of RAM > >>>>>>> * 12 cores with HT @ 2,4 GHz > >>>>>>> * WB cache is enabled on the controller > >>>>>>> * 10Gbps network (doesn't help here) > >>>>>>> > >>>>>>> The SSD is a 200G Intel DC S3700 and is capable of delivering > >>>>>>> around 29K iops with random 4k writes (my fio results) As a > >>>>>>> benchmark tool I used fio with the rbd engine (thanks deutsche > >>>>>>> telekom guys!). > >>>>>>> > >>>>>>> O_DIECT and D_SYNC don't seem to be a problem for the SSD: > >>>>>>> > >>>>>>> # dd if=/dev/urandom of=rand.file bs=4k count=65536 > >>>>>>> 65536+0 records in > >>>>>>> 65536+0 records out > >>>>>>> 268435456 bytes (268 MB) copied, 29.5477 s, 9.1 MB/s > >>>>>>> > >>>>>>> # du -sh rand.file > >>>>>>> 256M rand.file > >>>>>>> > >>>>>>> # dd if=rand.file of=/dev/sdo bs=4k count=65536 > >>>>>>> oflag=dsync,direct > >>>>>>> 65536+0 records in > >>>>>>> 65536+0 records out > >>>>>>> 268435456 bytes (268 MB) copied, 2.73628 s, 98.1 MB/s > >>>>>>> > >>>>>>> See my ceph.conf: > >>>>>>> > >>>>>>> [global] > >>>>>>> auth cluster required = cephx > >>>>>>> auth service required = cephx > >>>>>>> auth client required = cephx > >>>>>>> fsid = 857b8609-8c9b-499e-9161-2ea67ba51c97 > >>>>>>> osd pool default pg num = 4096 > >>>>>>> osd pool default pgp num = 4096 > >>>>>>> osd pool default size = 2 > >>>>>>> osd crush chooseleaf type = 0 > >>>>>>> > >>>>>>> debug lockdep = 0/0 > >>>>>>> debug context = 0/0 > >>>>>>> debug crush = 0/0 > >>>>>>> debug buffer = 0/0 > >>>>>>> debug timer = 0/0 > >>>>>>> debug journaler = 0/0 > >>>>>>> debug osd = 0/0 > >>>>>>> debug optracker = 0/0 > >>>>>>> debug objclass = 0/0 > >>>>>>> debug filestore = 0/0 > >>>>>>> debug journal = 0/0 > >>>>>>> debug ms = 0/0 > >>>>>>> debug monc = 0/0 > >>>>>>> debug tp = 0/0 > >>>>>>> debug auth = 0/0 > >>>>>>> debug finisher = 0/0 > >>>>>>> debug heartbeatmap = 0/0 > >>>>>>> debug perfcounter = 0/0 > >>>>>>> debug asok = 0/0 > >>>>>>> debug throttle = 0/0 > >>>>>>> > >>>>>>> [mon] > >>>>>>> mon osd down out interval = 600 > >>>>>>> mon osd min down reporters = 13 > >>>>>>> [mon.ceph-01] > >>>>>>> host = ceph-01 > >>>>>>> mon addr = 172.20.20.171 > >>>>>>> [mon.ceph-02] > >>>>>>> host = ceph-02 > >>>>>>> mon addr = 172.20.20.172 > >>>>>>> [mon.ceph-03] > >>>>>>> host = ceph-03 > >>>>>>> mon addr = 172.20.20.173 > >>>>>>> > >>>>>>> debug lockdep = 0/0 > >>>>>>> debug context = 0/0 > >>>>>>> debug crush = 0/0 > >>>>>>> debug buffer = 0/0 > >>>>>>> debug timer = 0/0 > >>>>>>> debug journaler = 0/0 > >>>>>>> debug osd = 0/0 > >>>>>>> debug optracker = 0/0 > >>>>>>> debug objclass = 0/0 > >>>>>>> debug filestore = 0/0 > >>>>>>> debug journal = 0/0 > >>>>>>> debug ms = 0/0 > >>>>>>> debug monc = 0/0 > >>>>>>> debug tp = 0/0 > >>>>>>> debug auth = 0/0 > >>>>>>> debug finisher = 0/0 > >>>>>>> debug heartbeatmap = 0/0 > >>>>>>> debug perfcounter = 0/0 > >>>>>>> debug asok = 0/0 > >>>>>>> debug throttle = 0/0 > >>>>>>> > >>>>>>> [osd] > >>>>>>> osd mkfs type = xfs > >>>>>>> osd mkfs options xfs = -f -i size=2048 osd mount options xfs = > >>>>>>> rw,noatime,logbsize=256k,delaylog osd journal size = 20480 > >>>>>>> cluster_network = 172.20.20.0/24 public_network = 172.20.20.0/24 > >>>>>>> osd mon heartbeat interval = 30 # Performance tuning filestore > >>>>>>> merge threshold = 40 filestore split multiple = 8 osd op threads > >>>>>>> = 8 # Recovery tuning osd recovery max active = 1 osd max > >>>>>>> backfills = 1 osd recovery op priority = 1 > >>>>>>> > >>>>>>> > >>>>>>> debug lockdep = 0/0 > >>>>>>> debug context = 0/0 > >>>>>>> debug crush = 0/0 > >>>>>>> debug buffer = 0/0 > >>>>>>> debug timer = 0/0 > >>>>>>> debug journaler = 0/0 > >>>>>>> debug osd = 0/0 > >>>>>>> debug optracker = 0/0 > >>>>>>> debug objclass = 0/0 > >>>>>>> debug filestore = 0/0 > >>>>>>> debug journal = 0/0 > >>>>>>> debug ms = 0/0 > >>>>>>> debug monc = 0/0 > >>>>>>> debug tp = 0/0 > >>>>>>> debug auth = 0/0 > >>>>>>> debug finisher = 0/0 > >>>>>>> debug heartbeatmap = 0/0 > >>>>>>> debug perfcounter = 0/0 > >>>>>>> debug asok = 0/0 > >>>>>>> debug throttle = 0/0 > >>>>>>> > >>>>>>> Disabling all debugging made me win 200/300 more IOPS. > >>>>>>> > >>>>>>> See my fio template: > >>>>>>> > >>>>>>> [global] > >>>>>>> #logging > >>>>>>> #write_iops_log=write_iops_log > >>>>>>> #write_bw_log=write_bw_log > >>>>>>> #write_lat_log=write_lat_lo > >>>>>>> > >>>>>>> time_based > >>>>>>> runtime=60 > >>>>>>> > >>>>>>> ioengine=rbd > >>>>>>> clientname=admin > >>>>>>> pool=test > >>>>>>> rbdname=fio > >>>>>>> invalidate=0 # mandatory > >>>>>>> #rw=randwrite > >>>>>>> rw=write > >>>>>>> bs=4k > >>>>>>> #bs=32m > >>>>>>> size=5G > >>>>>>> group_reporting > >>>>>>> > >>>>>>> [rbd_iodepth32] > >>>>>>> iodepth=32 > >>>>>>> direct=1 > >>>>>>> > >>>>>>> See my rio output: > >>>>>>> > >>>>>>> rbd_iodepth32: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, > >>>>>>> ioengine=rbd, iodepth=32 fio-2.1.11-14-gb74e Starting 1 process > >>>>>>> rbd engine: RBD version: 0.1.8 > >>>>>>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/12876KB/0KB /s] > >>>>>>> [0/3219/0 iops] [eta 00m:00s] > >>>>>>> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=32116: Thu Aug > >>>>>>> 28 00:28:26 2014 > >>>>>>> write: io=771448KB, bw=12855KB/s, iops=3213, runt= 60010msec > >>>>>>> slat (usec): min=42, max=1578, avg=66.50, stdev=16.96 clat > >>>>>>> (msec): min=1, max=28, avg= 9.85, stdev= 1.48 lat (msec): min=1, > >>>>>>> max=28, avg= 9.92, stdev= 1.47 clat percentiles (usec): > >>>>>>> | 1.00th=[ 6368], 5.00th=[ 8256], 10.00th=[ 8640], 20.00th=[ > >>>>>>> | 9152], 30.00th=[ 9408], 40.00th=[ 9664], 50.00th=[ 9792], > >>>>>>> | 60.00th=[10048], 70.00th=[10176], 80.00th=[10560], > >>>>>>> | 90.00th=[10944], 95.00th=[11456], 99.00th=[13120], > >>>>>>> | 99.50th=[16768], 99.90th=[25984], 99.95th=[27008], > >>>>>>> | 99.99th=[28032] > >>>>>>> bw (KB /s): min=11864, max=13808, per=100.00%, avg=12864.36, > >>>>>>> stdev=407.35 lat (msec) : 2=0.03%, 4=0.54%, 10=59.79%, > >>>>>>> 20=39.24%, 50=0.41% cpu : usr=19.15%, sys=4.69%, ctx=326309, > >>>>>>> majf=0, minf=426088 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, > >>>>>>> 16=33.9%, 32=66.1%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, > >>>>>>> 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=99.6%, > >>>>>>> 8=0.4%, 16=0.1%, 32=0.1%, 64=0.0%, >=64=0.0% issued : > >>>>>>> total=r=0/w=192862/d=0, short=r=0/w=0/d=0 latency : target=0, > >>>>>>> window=0, percentile=100.00%, depth=32 > >>>>>>> > >>>>>>> Run status group 0 (all jobs): > >>>>>>> WRITE: io=771448KB, aggrb=12855KB/s, minb=12855KB/s, > >>>>>>> maxb=12855KB/s, mint=60010msec, maxt=60010msec > >>>>>>> > >>>>>>> Disk stats (read/write): > >>>>>>> dm-1: ios=0/49, merge=0/0, ticks=0/12, in_queue=12, util=0.01%, > >>>>>>> aggrios=0/22, aggrmerge=0/27, aggrticks=0/12, aggrin_queue=12, > >>>>>>> aggrutil=0.01% > >>>>>>> sda: ios=0/22, merge=0/27, ticks=0/12, in_queue=12, util=0.01% > >>>>>>> > >>>>>>> I tried to tweak several parameters like: > >>>>>>> > >>>>>>> filestore_wbthrottle_xfs_ios_start_flusher = 10000 > >>>>>>> filestore_wbthrottle_xfs_ios_hard_limit = 10000 > >>>>>>> filestore_wbthrottle_btrfs_ios_start_flusher = 10000 > >>>>>>> filestore_wbthrottle_btrfs_ios_hard_limit = 10000 filestore > >>>>>>> queue max ops = 2000 > >>>>>>> > >>>>>>> But didn't any improvement. > >>>>>>> > >>>>>>> Then I tried other things: > >>>>>>> > >>>>>>> * Increasing the io_depth up to 256 or 512 gave me between 50 to > >>>>>>> 100 more IOPS but it's not a realistic workload anymore and not > >>>>>>> that significant. > >>>>>>> * adding another SSD for the journal, still getting 3,2K IOPS > >>>>>>> * I tried with rbd bench and I also got 3K IOPS > >>>>>>> * I ran the test on a client machine and then locally on the > >>>>>>> server, still getting 3,2K IOPS > >>>>>>> * put the journal in memory, still getting 3,2K IOPS > >>>>>>> * with 2 clients running the test in parallel I got a total of > >>>>>>> 3,6K IOPS but I don't seem to be able to go over > >>>>>>> * I tried is to add another OSD to that SSD, so I had 2 OSD and > >>>>>>> 2 journals on 1 SSD, got 4,5K IOPS YAY! > >>>>>>> > >>>>>>> Given the results of the last time it seems that something is > >>>>>>> limiting the number of IOPS per OSD process. > >>>>>>> > >>>>>>> Running the test on a client or locally didn't show any > >>>>>>> difference. So it looks to me that there is some contention > >>>>>>> within Ceph that might cause this. > >>>>>>> > >>>>>>> I also ran perf and looked at the output, everything looks > >>>>>>> decent, but someone might want to have a look at it :). > >>>>>>> > >>>>>>> We have been able to reproduce this on 3 distinct platforms with > >>>>>>> some deviations (because of the hardware) but the behaviour is > >>>>>>> the same. Any thoughts will be highly appreciated, only getting > >>>>>>> 3,2k out of an 29K IOPS SSD is a bit frustrating :). > >>>>>>> > >>>>>>> Cheers. > >>>>>>> ---- > >>>>>>> S?bastien Han > >>>>>>> Cloud Architect > >>>>>>> > >>>>>>> "Always give 100%. Unless you're giving blood." > >>>>>>> > >>>>>>> Phone: +33 (0)1 49 70 99 72 > >>>>>>> Mail: > >>>>>>> sebastien.han at enovance.com > >>>>>>> > >>>>>>> Address : 11 bis, rue Roqu?pine - 75008 Paris Web : > >>>>>>> www.enovance.com > >>>>>>> - Twitter : @enovance > >>>>>>> > >>>>>>> > >>>>>> Cheers. > >>>>>> ---- > >>>>>> S?bastien Han > >>>>>> Cloud Architect > >>>>>> > >>>>>> "Always give 100%. Unless you're giving blood." > >>>>>> > >>>>>> Phone: +33 (0)1 49 70 99 72 > >>>>>> Mail: > >>>>>> sebastien.han at enovance.com > >>>>>> > >>>>>> Address : 11 bis, rue Roqu?pine - 75008 Paris Web : > >>>>>> www.enovance.com > >>>>>> - Twitter : @enovance > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> ceph-users mailing list > >>>>>> > >>>>>> ceph-users at lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-u > >>>>>> sers-ceph.com > >>>>> -- > >>>>> C?dric > >>>>> > >>>>> _______________________________________________ > >>>>> ceph-users mailing list > >>>>> ceph-users at lists.ceph.com > >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>>> Cheers. > >>>> ---- > >>>> S?bastien Han > >>>> Cloud Architect > >>>> > >>>> "Always give 100%. Unless you're giving blood." > >>>> > >>>> Phone: +33 (0)1 49 70 99 72 > >>>> Mail: sebastien.han at enovance.com > >>>> Address : 11 bis, rue Roqu?pine - 75008 Paris Web : > >>>> www.enovance.com > >>>> - Twitter : @enovance > >>>> > >>>> > >>>> _______________________________________________ > >>>> ceph-users mailing list > >>>> ceph-users at lists.ceph.com > >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>>> _______________________________________________ > >>>> ceph-users mailing list > >>>> ceph-users at lists.ceph.com > >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > >> -- > >> C?dric > >> > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users at lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users at lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users at lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users at lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > Cheers. > > ---- > > S?bastien Han > > Cloud Architect > > > > "Always give 100%. Unless you're giving blood." > > > > Phone: +33 (0)1 49 70 99 72 > > Mail: sebastien.han at enovance.com > > Address : 11 bis, rue Roqu?pine - 75008 Paris Web : www.enovance.com - > > Twitter : @enovance > > > > > Cheers. > ???? > S?bastien Han > Cloud Architect > > "Always give 100%. Unless you're giving blood." > > Phone: +33 (0)1 49 70 99 72 > Mail: sebastien.han at enovance.com > Address : 11 bis, rue Roqu?pine - 75008 Paris > Web : www.enovance.com - Twitter : @enovance > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian Balzer Network/Systems Engineer chibi at gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/