[Single OSD performance on SSD] Can't go over 3, 2K IOPS

dmsimard@xxxxxxxx (David Moreau Simard) · Thu, 28 Aug 2014 18:32:00 +0000

That's definitely interesting.

Is this meant to be released in a dot release in Firefly or will they land
in Giant ?
-- 
David Moreau Simard

Le 2014-08-28, 1:49 PM, ? Somnath Roy ? <Somnath.Roy at sandisk.com> a ?crit :

>Yes, Mark, all of my changes are in ceph main now and we are getting
>significant RR performance improvement with that.
>
>Thanks & Regards
>Somnath
>
>-----Original Message-----
>From: ceph-users [mailto:ceph-users-bounces at lists.ceph.com] On Behalf Of
>Mark Nelson
>Sent: Thursday, August 28, 2014 10:43 AM
>To: ceph-users at lists.ceph.com
>Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go over
>3, 2K IOPS
>
>On 08/28/2014 12:39 PM, Somnath Roy wrote:
>> Hi Sebastian,
>> If you are trying with the latest Ceph master, there are some changes
>>we made that will be increasing your read performance from SSD a factor
>>of ~5X if the ios are hitting the disks. Otherwise, the serving from
>>memory the improvement is even more. The single OSD will be cpu bound
>>with increasing number of clients eventually both reading from disk and
>>memory scenario. Some new config option are introduced and here are
>>those.
>>
>>          osd_op_num_threads_per_shard
>>          osd_op_num_shards
>>          throttler_perf_counter
>>          osd_enable_op_tracker
>>          filestore_fd_cache_size
>>          filestore_fd_cache_shards
>>
>> The work pool for the io path is now sharded and the above options are
>>for controlling this. Osd_op_threads are no longer in the io path. Also,
>>the filestore FDcache is sharded now.
>> In my setup(64GB RAM, 40 core CPU with HT enabled)  the following
>>config file on a single OSD is giving optimum result for 4k RR read.
>>
>> [global]
>>
>>          filestore_xattr_use_omap = true
>>
>>          debug_lockdep = 0/0
>>          debug_context = 0/0
>>          debug_crush = 0/0
>>          debug_buffer = 0/0
>>          debug_timer = 0/0
>>          debug_filer = 0/0
>>          debug_objecter = 0/0
>>          debug_rados = 0/0
>>          debug_rbd = 0/0
>>          debug_journaler = 0/0
>>          debug_objectcatcher = 0/0
>>          debug_client = 0/0
>>          debug_osd = 0/0
>>          debug_optracker = 0/0
>>          debug_objclass = 0/0
>>          debug_filestore = 0/0
>>          debug_journal = 0/0
>>          debug_ms = 0/0
>>          debug_monc = 0/0
>>          debug_tp = 0/0
>>          debug_auth = 0/0
>>          debug_finisher = 0/0
>>          debug_heartbeatmap = 0/0
>>          debug_perfcounter = 0/0
>>          debug_asok = 0/0
>>          debug_throttle = 0/0
>>          debug_mon = 0/0
>>          debug_paxos = 0/0
>>          debug_rgw = 0/0
>>          osd_op_threads = 5
>>          osd_op_num_threads_per_shard = 1
>>          osd_op_num_shards = 25
>>          #osd_op_num_sharded_pool_threads = 25
>>          filestore_op_threads = 4
>>
>>          ms_nocrc = true
>>          filestore_fd_cache_size = 64
>>          filestore_fd_cache_shards = 32
>>          cephx sign messages = false
>>          cephx require signatures = false
>>
>>          ms_dispatch_throttle_bytes = 0
>>          throttler_perf_counter = false
>>
>>
>> [osd]
>>          osd_client_message_size_cap = 0
>>          osd_client_message_cap = 0
>>          osd_enable_op_tracker = false
>>
>>
>> What I saw optracker is one of the major bottleneck and we are in
>>process of optimizing that. For now, optracker enabled/disabled code
>>introduced. Also, there are several bottlenecks in the filestore level
>>are removed.
>> Unfortunately, we are yet to optimize the write path. All of these
>>should help the write path as well, but, write path improvement will not
>>be visible till all the lock serialization are removed.
>
>This is what I'm waiting for. :)  I've been meaning to ask you Somnath,
>how goes progress?
>
>Mark
>
>>
>> Thanks & Regards
>> Somnath
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces at lists.ceph.com] On Behalf
>> Of Sebastien Han
>> Sent: Thursday, August 28, 2014 9:12 AM
>> To: ceph-users
>> Cc: Mark Nelson
>> Subject: [ceph-users] [Single OSD performance on SSD] Can't go over 3,
>> 2K IOPS
>>
>> Hey all,
>>
>> It has been a while since the last thread performance related on the ML
>>:p I've been running some experiment to see how much I can get from an
>>SSD on a Ceph cluster.
>> To achieve that I did something pretty simple:
>>
>> * Debian wheezy 7.6
>> * kernel from debian 3.14-0.bpo.2-amd64
>> * 1 cluster, 3 mons (i'd like to keep this realistic since in a real
>> deployment i'll use 3)
>> * 1 OSD backed by an SSD (journal and osd data on the same device)
>> * 1 replica count of 1
>> * partitions are perfectly aligned
>> * io scheduler is set to noon but deadline was showing the same
>> results
>> * no updatedb running
>>
>> About the box:
>>
>> * 32GB of RAM
>> * 12 cores with HT @ 2,4 GHz
>> * WB cache is enabled on the controller
>> * 10Gbps network (doesn't help here)
>>
>> The SSD is a 200G Intel DC S3700 and is capable of delivering around
>>29K iops with random 4k writes (my fio results) As a benchmark tool I
>>used fio with the rbd engine (thanks deutsche telekom guys!).
>>
>> O_DIECT and D_SYNC don't seem to be a problem for the SSD:
>>
>> # dd if=/dev/urandom of=rand.file bs=4k count=65536
>> 65536+0 records in
>> 65536+0 records out
>> 268435456 bytes (268 MB) copied, 29.5477 s, 9.1 MB/s
>>
>> # du -sh rand.file
>> 256M    rand.file
>>
>> # dd if=rand.file of=/dev/sdo bs=4k count=65536 oflag=dsync,direct
>> 65536+0 records in
>> 65536+0 records out
>> 268435456 bytes (268 MB) copied, 2.73628 s, 98.1 MB/s
>>
>> See my ceph.conf:
>>
>> [global]
>>    auth cluster required = cephx
>>    auth service required = cephx
>>    auth client required = cephx
>>    fsid = 857b8609-8c9b-499e-9161-2ea67ba51c97
>>    osd pool default pg num = 4096
>>    osd pool default pgp num = 4096
>>    osd pool default size = 2
>>    osd crush chooseleaf type = 0
>>
>>     debug lockdep = 0/0
>>          debug context = 0/0
>>          debug crush = 0/0
>>          debug buffer = 0/0
>>          debug timer = 0/0
>>          debug journaler = 0/0
>>          debug osd = 0/0
>>          debug optracker = 0/0
>>          debug objclass = 0/0
>>          debug filestore = 0/0
>>          debug journal = 0/0
>>          debug ms = 0/0
>>          debug monc = 0/0
>>          debug tp = 0/0
>>          debug auth = 0/0
>>          debug finisher = 0/0
>>          debug heartbeatmap = 0/0
>>          debug perfcounter = 0/0
>>          debug asok = 0/0
>>          debug throttle = 0/0
>>
>> [mon]
>>    mon osd down out interval = 600
>>    mon osd min down reporters = 13
>>      [mon.ceph-01]
>>      host = ceph-01
>>      mon addr = 172.20.20.171
>>        [mon.ceph-02]
>>      host = ceph-02
>>      mon addr = 172.20.20.172
>>        [mon.ceph-03]
>>      host = ceph-03
>>      mon addr = 172.20.20.173
>>
>>          debug lockdep = 0/0
>>          debug context = 0/0
>>          debug crush = 0/0
>>          debug buffer = 0/0
>>          debug timer = 0/0
>>          debug journaler = 0/0
>>          debug osd = 0/0
>>          debug optracker = 0/0
>>          debug objclass = 0/0
>>          debug filestore = 0/0
>>          debug journal = 0/0
>>          debug ms = 0/0
>>          debug monc = 0/0
>>          debug tp = 0/0
>>          debug auth = 0/0
>>          debug finisher = 0/0
>>          debug heartbeatmap = 0/0
>>          debug perfcounter = 0/0
>>          debug asok = 0/0
>>          debug throttle = 0/0
>>
>> [osd]
>>    osd mkfs type = xfs
>> osd mkfs options xfs = -f -i size=2048 osd mount options xfs =
>> rw,noatime,logbsize=256k,delaylog
>>    osd journal size = 20480
>>    cluster_network = 172.20.20.0/24
>>    public_network = 172.20.20.0/24
>>    osd mon heartbeat interval = 30
>>    # Performance tuning
>>    filestore merge threshold = 40
>>    filestore split multiple = 8
>>    osd op threads = 8
>>    # Recovery tuning
>>    osd recovery max active = 1
>>    osd max backfills = 1
>>    osd recovery op priority = 1
>>
>>
>>          debug lockdep = 0/0
>>          debug context = 0/0
>>          debug crush = 0/0
>>          debug buffer = 0/0
>>          debug timer = 0/0
>>          debug journaler = 0/0
>>          debug osd = 0/0
>>          debug optracker = 0/0
>>          debug objclass = 0/0
>>          debug filestore = 0/0
>>          debug journal = 0/0
>>          debug ms = 0/0
>>          debug monc = 0/0
>>          debug tp = 0/0
>>          debug auth = 0/0
>>          debug finisher = 0/0
>>          debug heartbeatmap = 0/0
>>          debug perfcounter = 0/0
>>          debug asok = 0/0
>>          debug throttle = 0/0
>>
>> Disabling all debugging made me win 200/300 more IOPS.
>>
>> See my fio template:
>>
>> [global]
>> #logging
>> #write_iops_log=write_iops_log
>> #write_bw_log=write_bw_log
>> #write_lat_log=write_lat_lo
>>
>> time_based
>> runtime=60
>>
>> ioengine=rbd
>> clientname=admin
>> pool=test
>> rbdname=fio
>> invalidate=0    # mandatory
>> #rw=randwrite
>> rw=write
>> bs=4k
>> #bs=32m
>> size=5G
>> group_reporting
>>
>> [rbd_iodepth32]
>> iodepth=32
>> direct=1
>>
>> See my rio output:
>>
>> rbd_iodepth32: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd,
>> iodepth=32 fio-2.1.11-14-gb74e Starting 1 process rbd engine: RBD
>> version: 0.1.8
>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/12876KB/0KB /s] [0/3219/0
>> iops] [eta 00m:00s]
>> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=32116: Thu Aug 28
>>00:28:26 2014
>>    write: io=771448KB, bw=12855KB/s, iops=3213, runt= 60010msec
>>      slat (usec): min=42, max=1578, avg=66.50, stdev=16.96
>>      clat (msec): min=1, max=28, avg= 9.85, stdev= 1.48
>>       lat (msec): min=1, max=28, avg= 9.92, stdev= 1.47
>>      clat percentiles (usec):
>>       |  1.00th=[ 6368],  5.00th=[ 8256], 10.00th=[ 8640], 20.00th=[
>>9152],
>>       | 30.00th=[ 9408], 40.00th=[ 9664], 50.00th=[ 9792],
>>60.00th=[10048],
>>       | 70.00th=[10176], 80.00th=[10560], 90.00th=[10944],
>>95.00th=[11456],
>>       | 99.00th=[13120], 99.50th=[16768], 99.90th=[25984],
>>99.95th=[27008],
>>       | 99.99th=[28032]
>>      bw (KB  /s): min=11864, max=13808, per=100.00%, avg=12864.36,
>>stdev=407.35
>>      lat (msec) : 2=0.03%, 4=0.54%, 10=59.79%, 20=39.24%, 50=0.41%
>>    cpu          : usr=19.15%, sys=4.69%, ctx=326309, majf=0, minf=426088
>>    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=33.9%, 32=66.1%,
>>>=64=0.0%
>>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>=64=0.0%
>>       complete  : 0=0.0%, 4=99.6%, 8=0.4%, 16=0.1%, 32=0.1%, 64=0.0%,
>>>=64=0.0%
>>       issued    : total=r=0/w=192862/d=0, short=r=0/w=0/d=0
>>       latency   : target=0, window=0, percentile=100.00%, depth=32
>>
>> Run status group 0 (all jobs):
>>    WRITE: io=771448KB, aggrb=12855KB/s, minb=12855KB/s,
>> maxb=12855KB/s, mint=60010msec, maxt=60010msec
>>
>> Disk stats (read/write):
>>      dm-1: ios=0/49, merge=0/0, ticks=0/12, in_queue=12, util=0.01%,
>>aggrios=0/22, aggrmerge=0/27, aggrticks=0/12, aggrin_queue=12,
>>aggrutil=0.01%
>>    sda: ios=0/22, merge=0/27, ticks=0/12, in_queue=12, util=0.01%
>>
>> I tried to tweak several parameters like:
>>
>> filestore_wbthrottle_xfs_ios_start_flusher = 10000
>> filestore_wbthrottle_xfs_ios_hard_limit = 10000
>> filestore_wbthrottle_btrfs_ios_start_flusher = 10000
>> filestore_wbthrottle_btrfs_ios_hard_limit = 10000 filestore queue max
>> ops = 2000
>>
>> But didn't any improvement.
>>
>> Then I tried other things:
>>
>> * Increasing the io_depth up to 256 or 512 gave me between 50 to 100
>>more IOPS but it's not a realistic workload anymore and not that
>>significant.
>> * adding another SSD for the journal, still getting 3,2K IOPS
>> * I tried with rbd bench and I also got 3K IOPS
>> * I ran the test on a client machine and then locally on the server,
>> still getting 3,2K IOPS
>> * put the journal in memory, still getting 3,2K IOPS
>> * with 2 clients running the test in parallel I got a total of 3,6K
>> IOPS but I don't seem to be able to go over
>> * I tried is to add another OSD to that SSD, so I had 2 OSD and 2
>>journals on 1 SSD, got 4,5K IOPS YAY!
>>
>> Given the results of the last time it seems that something is limiting
>>the number of IOPS per OSD process.
>>
>> Running the test on a client or locally didn't show any difference.
>> So it looks to me that there is some contention within Ceph that might
>>cause this.
>>
>> I also ran perf and looked at the output, everything looks decent, but
>>someone might want to have a look at it :).
>>
>> We have been able to reproduce this on 3 distinct platforms with some
>>deviations (because of the hardware) but the behaviour is the same.
>> Any thoughts will be highly appreciated, only getting 3,2k out of an
>>29K IOPS SSD is a bit frustrating :).
>>
>> Cheers.
>> ----
>> S?bastien Han
>> Cloud Architect
>>
>> "Always give 100%. Unless you're giving blood."
>>
>> Phone: +33 (0)1 49 70 99 72
>> Mail: sebastien.han at enovance.com
>> Address : 11 bis, rue Roqu?pine - 75008 Paris Web : www.enovance.com -
>> Twitter : @enovance
>>
>>
>> ________________________________
>>
>> PLEASE NOTE: The information contained in this electronic mail message
>>is intended only for the use of the designated recipient(s) named above.
>>If the reader of this message is not the intended recipient, you are
>>hereby notified that you have received this message in error and that
>>any review, dissemination, distribution, or copying of this message is
>>strictly prohibited. If you have received this communication in error,
>>please notify the sender by telephone or e-mail (as shown above)
>>immediately and destroy any and all copies of this message in your
>>possession (whether hard copies or electronically stored copies).
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>_______________________________________________
>ceph-users mailing list
>ceph-users at lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>_______________________________________________
>ceph-users mailing list
>ceph-users at lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com