That's definitely interesting. Is this meant to be released in a dot release in Firefly or will they land in Giant ? -- David Moreau Simard Le 2014-08-28, 1:49 PM, ? Somnath Roy ? <Somnath.Roy at sandisk.com> a ?crit : >Yes, Mark, all of my changes are in ceph main now and we are getting >significant RR performance improvement with that. > >Thanks & Regards >Somnath > >-----Original Message----- >From: ceph-users [mailto:ceph-users-bounces at lists.ceph.com] On Behalf Of >Mark Nelson >Sent: Thursday, August 28, 2014 10:43 AM >To: ceph-users at lists.ceph.com >Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go over >3, 2K IOPS > >On 08/28/2014 12:39 PM, Somnath Roy wrote: >> Hi Sebastian, >> If you are trying with the latest Ceph master, there are some changes >>we made that will be increasing your read performance from SSD a factor >>of ~5X if the ios are hitting the disks. Otherwise, the serving from >>memory the improvement is even more. The single OSD will be cpu bound >>with increasing number of clients eventually both reading from disk and >>memory scenario. Some new config option are introduced and here are >>those. >> >> osd_op_num_threads_per_shard >> osd_op_num_shards >> throttler_perf_counter >> osd_enable_op_tracker >> filestore_fd_cache_size >> filestore_fd_cache_shards >> >> The work pool for the io path is now sharded and the above options are >>for controlling this. Osd_op_threads are no longer in the io path. Also, >>the filestore FDcache is sharded now. >> In my setup(64GB RAM, 40 core CPU with HT enabled) the following >>config file on a single OSD is giving optimum result for 4k RR read. >> >> [global] >> >> filestore_xattr_use_omap = true >> >> debug_lockdep = 0/0 >> debug_context = 0/0 >> debug_crush = 0/0 >> debug_buffer = 0/0 >> debug_timer = 0/0 >> debug_filer = 0/0 >> debug_objecter = 0/0 >> debug_rados = 0/0 >> debug_rbd = 0/0 >> debug_journaler = 0/0 >> debug_objectcatcher = 0/0 >> debug_client = 0/0 >> debug_osd = 0/0 >> debug_optracker = 0/0 >> debug_objclass = 0/0 >> debug_filestore = 0/0 >> debug_journal = 0/0 >> debug_ms = 0/0 >> debug_monc = 0/0 >> debug_tp = 0/0 >> debug_auth = 0/0 >> debug_finisher = 0/0 >> debug_heartbeatmap = 0/0 >> debug_perfcounter = 0/0 >> debug_asok = 0/0 >> debug_throttle = 0/0 >> debug_mon = 0/0 >> debug_paxos = 0/0 >> debug_rgw = 0/0 >> osd_op_threads = 5 >> osd_op_num_threads_per_shard = 1 >> osd_op_num_shards = 25 >> #osd_op_num_sharded_pool_threads = 25 >> filestore_op_threads = 4 >> >> ms_nocrc = true >> filestore_fd_cache_size = 64 >> filestore_fd_cache_shards = 32 >> cephx sign messages = false >> cephx require signatures = false >> >> ms_dispatch_throttle_bytes = 0 >> throttler_perf_counter = false >> >> >> [osd] >> osd_client_message_size_cap = 0 >> osd_client_message_cap = 0 >> osd_enable_op_tracker = false >> >> >> What I saw optracker is one of the major bottleneck and we are in >>process of optimizing that. For now, optracker enabled/disabled code >>introduced. Also, there are several bottlenecks in the filestore level >>are removed. >> Unfortunately, we are yet to optimize the write path. All of these >>should help the write path as well, but, write path improvement will not >>be visible till all the lock serialization are removed. > >This is what I'm waiting for. :) I've been meaning to ask you Somnath, >how goes progress? > >Mark > >> >> Thanks & Regards >> Somnath >> -----Original Message----- >> From: ceph-users [mailto:ceph-users-bounces at lists.ceph.com] On Behalf >> Of Sebastien Han >> Sent: Thursday, August 28, 2014 9:12 AM >> To: ceph-users >> Cc: Mark Nelson >> Subject: [ceph-users] [Single OSD performance on SSD] Can't go over 3, >> 2K IOPS >> >> Hey all, >> >> It has been a while since the last thread performance related on the ML >>:p I've been running some experiment to see how much I can get from an >>SSD on a Ceph cluster. >> To achieve that I did something pretty simple: >> >> * Debian wheezy 7.6 >> * kernel from debian 3.14-0.bpo.2-amd64 >> * 1 cluster, 3 mons (i'd like to keep this realistic since in a real >> deployment i'll use 3) >> * 1 OSD backed by an SSD (journal and osd data on the same device) >> * 1 replica count of 1 >> * partitions are perfectly aligned >> * io scheduler is set to noon but deadline was showing the same >> results >> * no updatedb running >> >> About the box: >> >> * 32GB of RAM >> * 12 cores with HT @ 2,4 GHz >> * WB cache is enabled on the controller >> * 10Gbps network (doesn't help here) >> >> The SSD is a 200G Intel DC S3700 and is capable of delivering around >>29K iops with random 4k writes (my fio results) As a benchmark tool I >>used fio with the rbd engine (thanks deutsche telekom guys!). >> >> O_DIECT and D_SYNC don't seem to be a problem for the SSD: >> >> # dd if=/dev/urandom of=rand.file bs=4k count=65536 >> 65536+0 records in >> 65536+0 records out >> 268435456 bytes (268 MB) copied, 29.5477 s, 9.1 MB/s >> >> # du -sh rand.file >> 256M rand.file >> >> # dd if=rand.file of=/dev/sdo bs=4k count=65536 oflag=dsync,direct >> 65536+0 records in >> 65536+0 records out >> 268435456 bytes (268 MB) copied, 2.73628 s, 98.1 MB/s >> >> See my ceph.conf: >> >> [global] >> auth cluster required = cephx >> auth service required = cephx >> auth client required = cephx >> fsid = 857b8609-8c9b-499e-9161-2ea67ba51c97 >> osd pool default pg num = 4096 >> osd pool default pgp num = 4096 >> osd pool default size = 2 >> osd crush chooseleaf type = 0 >> >> debug lockdep = 0/0 >> debug context = 0/0 >> debug crush = 0/0 >> debug buffer = 0/0 >> debug timer = 0/0 >> debug journaler = 0/0 >> debug osd = 0/0 >> debug optracker = 0/0 >> debug objclass = 0/0 >> debug filestore = 0/0 >> debug journal = 0/0 >> debug ms = 0/0 >> debug monc = 0/0 >> debug tp = 0/0 >> debug auth = 0/0 >> debug finisher = 0/0 >> debug heartbeatmap = 0/0 >> debug perfcounter = 0/0 >> debug asok = 0/0 >> debug throttle = 0/0 >> >> [mon] >> mon osd down out interval = 600 >> mon osd min down reporters = 13 >> [mon.ceph-01] >> host = ceph-01 >> mon addr = 172.20.20.171 >> [mon.ceph-02] >> host = ceph-02 >> mon addr = 172.20.20.172 >> [mon.ceph-03] >> host = ceph-03 >> mon addr = 172.20.20.173 >> >> debug lockdep = 0/0 >> debug context = 0/0 >> debug crush = 0/0 >> debug buffer = 0/0 >> debug timer = 0/0 >> debug journaler = 0/0 >> debug osd = 0/0 >> debug optracker = 0/0 >> debug objclass = 0/0 >> debug filestore = 0/0 >> debug journal = 0/0 >> debug ms = 0/0 >> debug monc = 0/0 >> debug tp = 0/0 >> debug auth = 0/0 >> debug finisher = 0/0 >> debug heartbeatmap = 0/0 >> debug perfcounter = 0/0 >> debug asok = 0/0 >> debug throttle = 0/0 >> >> [osd] >> osd mkfs type = xfs >> osd mkfs options xfs = -f -i size=2048 osd mount options xfs = >> rw,noatime,logbsize=256k,delaylog >> osd journal size = 20480 >> cluster_network = 172.20.20.0/24 >> public_network = 172.20.20.0/24 >> osd mon heartbeat interval = 30 >> # Performance tuning >> filestore merge threshold = 40 >> filestore split multiple = 8 >> osd op threads = 8 >> # Recovery tuning >> osd recovery max active = 1 >> osd max backfills = 1 >> osd recovery op priority = 1 >> >> >> debug lockdep = 0/0 >> debug context = 0/0 >> debug crush = 0/0 >> debug buffer = 0/0 >> debug timer = 0/0 >> debug journaler = 0/0 >> debug osd = 0/0 >> debug optracker = 0/0 >> debug objclass = 0/0 >> debug filestore = 0/0 >> debug journal = 0/0 >> debug ms = 0/0 >> debug monc = 0/0 >> debug tp = 0/0 >> debug auth = 0/0 >> debug finisher = 0/0 >> debug heartbeatmap = 0/0 >> debug perfcounter = 0/0 >> debug asok = 0/0 >> debug throttle = 0/0 >> >> Disabling all debugging made me win 200/300 more IOPS. >> >> See my fio template: >> >> [global] >> #logging >> #write_iops_log=write_iops_log >> #write_bw_log=write_bw_log >> #write_lat_log=write_lat_lo >> >> time_based >> runtime=60 >> >> ioengine=rbd >> clientname=admin >> pool=test >> rbdname=fio >> invalidate=0 # mandatory >> #rw=randwrite >> rw=write >> bs=4k >> #bs=32m >> size=5G >> group_reporting >> >> [rbd_iodepth32] >> iodepth=32 >> direct=1 >> >> See my rio output: >> >> rbd_iodepth32: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, >> iodepth=32 fio-2.1.11-14-gb74e Starting 1 process rbd engine: RBD >> version: 0.1.8 >> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/12876KB/0KB /s] [0/3219/0 >> iops] [eta 00m:00s] >> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=32116: Thu Aug 28 >>00:28:26 2014 >> write: io=771448KB, bw=12855KB/s, iops=3213, runt= 60010msec >> slat (usec): min=42, max=1578, avg=66.50, stdev=16.96 >> clat (msec): min=1, max=28, avg= 9.85, stdev= 1.48 >> lat (msec): min=1, max=28, avg= 9.92, stdev= 1.47 >> clat percentiles (usec): >> | 1.00th=[ 6368], 5.00th=[ 8256], 10.00th=[ 8640], 20.00th=[ >>9152], >> | 30.00th=[ 9408], 40.00th=[ 9664], 50.00th=[ 9792], >>60.00th=[10048], >> | 70.00th=[10176], 80.00th=[10560], 90.00th=[10944], >>95.00th=[11456], >> | 99.00th=[13120], 99.50th=[16768], 99.90th=[25984], >>99.95th=[27008], >> | 99.99th=[28032] >> bw (KB /s): min=11864, max=13808, per=100.00%, avg=12864.36, >>stdev=407.35 >> lat (msec) : 2=0.03%, 4=0.54%, 10=59.79%, 20=39.24%, 50=0.41% >> cpu : usr=19.15%, sys=4.69%, ctx=326309, majf=0, minf=426088 >> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=33.9%, 32=66.1%, >>>=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >>>=64=0.0% >> complete : 0=0.0%, 4=99.6%, 8=0.4%, 16=0.1%, 32=0.1%, 64=0.0%, >>>=64=0.0% >> issued : total=r=0/w=192862/d=0, short=r=0/w=0/d=0 >> latency : target=0, window=0, percentile=100.00%, depth=32 >> >> Run status group 0 (all jobs): >> WRITE: io=771448KB, aggrb=12855KB/s, minb=12855KB/s, >> maxb=12855KB/s, mint=60010msec, maxt=60010msec >> >> Disk stats (read/write): >> dm-1: ios=0/49, merge=0/0, ticks=0/12, in_queue=12, util=0.01%, >>aggrios=0/22, aggrmerge=0/27, aggrticks=0/12, aggrin_queue=12, >>aggrutil=0.01% >> sda: ios=0/22, merge=0/27, ticks=0/12, in_queue=12, util=0.01% >> >> I tried to tweak several parameters like: >> >> filestore_wbthrottle_xfs_ios_start_flusher = 10000 >> filestore_wbthrottle_xfs_ios_hard_limit = 10000 >> filestore_wbthrottle_btrfs_ios_start_flusher = 10000 >> filestore_wbthrottle_btrfs_ios_hard_limit = 10000 filestore queue max >> ops = 2000 >> >> But didn't any improvement. >> >> Then I tried other things: >> >> * Increasing the io_depth up to 256 or 512 gave me between 50 to 100 >>more IOPS but it's not a realistic workload anymore and not that >>significant. >> * adding another SSD for the journal, still getting 3,2K IOPS >> * I tried with rbd bench and I also got 3K IOPS >> * I ran the test on a client machine and then locally on the server, >> still getting 3,2K IOPS >> * put the journal in memory, still getting 3,2K IOPS >> * with 2 clients running the test in parallel I got a total of 3,6K >> IOPS but I don't seem to be able to go over >> * I tried is to add another OSD to that SSD, so I had 2 OSD and 2 >>journals on 1 SSD, got 4,5K IOPS YAY! >> >> Given the results of the last time it seems that something is limiting >>the number of IOPS per OSD process. >> >> Running the test on a client or locally didn't show any difference. >> So it looks to me that there is some contention within Ceph that might >>cause this. >> >> I also ran perf and looked at the output, everything looks decent, but >>someone might want to have a look at it :). >> >> We have been able to reproduce this on 3 distinct platforms with some >>deviations (because of the hardware) but the behaviour is the same. >> Any thoughts will be highly appreciated, only getting 3,2k out of an >>29K IOPS SSD is a bit frustrating :). >> >> Cheers. >> ---- >> S?bastien Han >> Cloud Architect >> >> "Always give 100%. Unless you're giving blood." >> >> Phone: +33 (0)1 49 70 99 72 >> Mail: sebastien.han at enovance.com >> Address : 11 bis, rue Roqu?pine - 75008 Paris Web : www.enovance.com - >> Twitter : @enovance >> >> >> ________________________________ >> >> PLEASE NOTE: The information contained in this electronic mail message >>is intended only for the use of the designated recipient(s) named above. >>If the reader of this message is not the intended recipient, you are >>hereby notified that you have received this message in error and that >>any review, dissemination, distribution, or copying of this message is >>strictly prohibited. If you have received this communication in error, >>please notify the sender by telephone or e-mail (as shown above) >>immediately and destroy any and all copies of this message in your >>possession (whether hard copies or electronically stored copies). >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users at lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >_______________________________________________ >ceph-users mailing list >ceph-users at lists.ceph.com >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >_______________________________________________ >ceph-users mailing list >ceph-users at lists.ceph.com >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com