[Single OSD performance on SSD] Can't go over 3, 2K IOPS

mark.nelson@xxxxxxxxxxx (Mark Nelson) · Thu, 28 Aug 2014 12:43:07 -0500

On 08/28/2014 12:39 PM, Somnath Roy wrote:
> Hi Sebastian,
> If you are trying with the latest Ceph master, there are some changes we made that will be increasing your read performance from SSD a factor of ~5X if the ios are hitting the disks. Otherwise, the serving from memory the improvement is even more. The single OSD will be cpu bound with increasing number of clients eventually both reading from disk and memory scenario. Some new config option are introduced and here are those.
>
>          osd_op_num_threads_per_shard
>          osd_op_num_shards
>          throttler_perf_counter
>          osd_enable_op_tracker
>          filestore_fd_cache_size
>          filestore_fd_cache_shards
>
> The work pool for the io path is now sharded and the above options are for controlling this. Osd_op_threads are no longer in the io path. Also, the filestore FDcache is sharded now.
> In my setup(64GB RAM, 40 core CPU with HT enabled)  the following config file on a single OSD is giving optimum result for 4k RR read.
>
> [global]
>
>          filestore_xattr_use_omap = true
>
>          debug_lockdep = 0/0
>          debug_context = 0/0
>          debug_crush = 0/0
>          debug_buffer = 0/0
>          debug_timer = 0/0
>          debug_filer = 0/0
>          debug_objecter = 0/0
>          debug_rados = 0/0
>          debug_rbd = 0/0
>          debug_journaler = 0/0
>          debug_objectcatcher = 0/0
>          debug_client = 0/0
>          debug_osd = 0/0
>          debug_optracker = 0/0
>          debug_objclass = 0/0
>          debug_filestore = 0/0
>          debug_journal = 0/0
>          debug_ms = 0/0
>          debug_monc = 0/0
>          debug_tp = 0/0
>          debug_auth = 0/0
>          debug_finisher = 0/0
>          debug_heartbeatmap = 0/0
>          debug_perfcounter = 0/0
>          debug_asok = 0/0
>          debug_throttle = 0/0
>          debug_mon = 0/0
>          debug_paxos = 0/0
>          debug_rgw = 0/0
>          osd_op_threads = 5
>          osd_op_num_threads_per_shard = 1
>          osd_op_num_shards = 25
>          #osd_op_num_sharded_pool_threads = 25
>          filestore_op_threads = 4
>
>          ms_nocrc = true
>          filestore_fd_cache_size = 64
>          filestore_fd_cache_shards = 32
>          cephx sign messages = false
>          cephx require signatures = false
>
>          ms_dispatch_throttle_bytes = 0
>          throttler_perf_counter = false
>
>
> [osd]
>          osd_client_message_size_cap = 0
>          osd_client_message_cap = 0
>          osd_enable_op_tracker = false
>
>
> What I saw optracker is one of the major bottleneck and we are in process of optimizing that. For now, optracker enabled/disabled code introduced. Also, there are several bottlenecks in the filestore level are removed.
> Unfortunately, we are yet to optimize the write path. All of these should help the write path as well, but, write path improvement will not be visible till all the lock serialization are removed.

This is what I'm waiting for. :)  I've been meaning to ask you Somnath, 
how goes progress?

Mark

>
> Thanks & Regards
> Somnath
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces at lists.ceph.com] On Behalf Of Sebastien Han
> Sent: Thursday, August 28, 2014 9:12 AM
> To: ceph-users
> Cc: Mark Nelson
> Subject: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS
>
> Hey all,
>
> It has been a while since the last thread performance related on the ML :p I've been running some experiment to see how much I can get from an SSD on a Ceph cluster.
> To achieve that I did something pretty simple:
>
> * Debian wheezy 7.6
> * kernel from debian 3.14-0.bpo.2-amd64
> * 1 cluster, 3 mons (i'd like to keep this realistic since in a real deployment i'll use 3)
> * 1 OSD backed by an SSD (journal and osd data on the same device)
> * 1 replica count of 1
> * partitions are perfectly aligned
> * io scheduler is set to noon but deadline was showing the same results
> * no updatedb running
>
> About the box:
>
> * 32GB of RAM
> * 12 cores with HT @ 2,4 GHz
> * WB cache is enabled on the controller
> * 10Gbps network (doesn't help here)
>
> The SSD is a 200G Intel DC S3700 and is capable of delivering around 29K iops with random 4k writes (my fio results) As a benchmark tool I used fio with the rbd engine (thanks deutsche telekom guys!).
>
> O_DIECT and D_SYNC don't seem to be a problem for the SSD:
>
> # dd if=/dev/urandom of=rand.file bs=4k count=65536
> 65536+0 records in
> 65536+0 records out
> 268435456 bytes (268 MB) copied, 29.5477 s, 9.1 MB/s
>
> # du -sh rand.file
> 256M    rand.file
>
> # dd if=rand.file of=/dev/sdo bs=4k count=65536 oflag=dsync,direct
> 65536+0 records in
> 65536+0 records out
> 268435456 bytes (268 MB) copied, 2.73628 s, 98.1 MB/s
>
> See my ceph.conf:
>
> [global]
>    auth cluster required = cephx
>    auth service required = cephx
>    auth client required = cephx
>    fsid = 857b8609-8c9b-499e-9161-2ea67ba51c97
>    osd pool default pg num = 4096
>    osd pool default pgp num = 4096
>    osd pool default size = 2
>    osd crush chooseleaf type = 0
>
>     debug lockdep = 0/0
>          debug context = 0/0
>          debug crush = 0/0
>          debug buffer = 0/0
>          debug timer = 0/0
>          debug journaler = 0/0
>          debug osd = 0/0
>          debug optracker = 0/0
>          debug objclass = 0/0
>          debug filestore = 0/0
>          debug journal = 0/0
>          debug ms = 0/0
>          debug monc = 0/0
>          debug tp = 0/0
>          debug auth = 0/0
>          debug finisher = 0/0
>          debug heartbeatmap = 0/0
>          debug perfcounter = 0/0
>          debug asok = 0/0
>          debug throttle = 0/0
>
> [mon]
>    mon osd down out interval = 600
>    mon osd min down reporters = 13
>      [mon.ceph-01]
>      host = ceph-01
>      mon addr = 172.20.20.171
>        [mon.ceph-02]
>      host = ceph-02
>      mon addr = 172.20.20.172
>        [mon.ceph-03]
>      host = ceph-03
>      mon addr = 172.20.20.173
>
>          debug lockdep = 0/0
>          debug context = 0/0
>          debug crush = 0/0
>          debug buffer = 0/0
>          debug timer = 0/0
>          debug journaler = 0/0
>          debug osd = 0/0
>          debug optracker = 0/0
>          debug objclass = 0/0
>          debug filestore = 0/0
>          debug journal = 0/0
>          debug ms = 0/0
>          debug monc = 0/0
>          debug tp = 0/0
>          debug auth = 0/0
>          debug finisher = 0/0
>          debug heartbeatmap = 0/0
>          debug perfcounter = 0/0
>          debug asok = 0/0
>          debug throttle = 0/0
>
> [osd]
>    osd mkfs type = xfs
> osd mkfs options xfs = -f -i size=2048
> osd mount options xfs = rw,noatime,logbsize=256k,delaylog
>    osd journal size = 20480
>    cluster_network = 172.20.20.0/24
>    public_network = 172.20.20.0/24
>    osd mon heartbeat interval = 30
>    # Performance tuning
>    filestore merge threshold = 40
>    filestore split multiple = 8
>    osd op threads = 8
>    # Recovery tuning
>    osd recovery max active = 1
>    osd max backfills = 1
>    osd recovery op priority = 1
>
>
>          debug lockdep = 0/0
>          debug context = 0/0
>          debug crush = 0/0
>          debug buffer = 0/0
>          debug timer = 0/0
>          debug journaler = 0/0
>          debug osd = 0/0
>          debug optracker = 0/0
>          debug objclass = 0/0
>          debug filestore = 0/0
>          debug journal = 0/0
>          debug ms = 0/0
>          debug monc = 0/0
>          debug tp = 0/0
>          debug auth = 0/0
>          debug finisher = 0/0
>          debug heartbeatmap = 0/0
>          debug perfcounter = 0/0
>          debug asok = 0/0
>          debug throttle = 0/0
>
> Disabling all debugging made me win 200/300 more IOPS.
>
> See my fio template:
>
> [global]
> #logging
> #write_iops_log=write_iops_log
> #write_bw_log=write_bw_log
> #write_lat_log=write_lat_lo
>
> time_based
> runtime=60
>
> ioengine=rbd
> clientname=admin
> pool=test
> rbdname=fio
> invalidate=0    # mandatory
> #rw=randwrite
> rw=write
> bs=4k
> #bs=32m
> size=5G
> group_reporting
>
> [rbd_iodepth32]
> iodepth=32
> direct=1
>
> See my rio output:
>
> rbd_iodepth32: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 fio-2.1.11-14-gb74e Starting 1 process rbd engine: RBD version: 0.1.8
> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/12876KB/0KB /s] [0/3219/0 iops] [eta 00m:00s]
> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=32116: Thu Aug 28 00:28:26 2014
>    write: io=771448KB, bw=12855KB/s, iops=3213, runt= 60010msec
>      slat (usec): min=42, max=1578, avg=66.50, stdev=16.96
>      clat (msec): min=1, max=28, avg= 9.85, stdev= 1.48
>       lat (msec): min=1, max=28, avg= 9.92, stdev= 1.47
>      clat percentiles (usec):
>       |  1.00th=[ 6368],  5.00th=[ 8256], 10.00th=[ 8640], 20.00th=[ 9152],
>       | 30.00th=[ 9408], 40.00th=[ 9664], 50.00th=[ 9792], 60.00th=[10048],
>       | 70.00th=[10176], 80.00th=[10560], 90.00th=[10944], 95.00th=[11456],
>       | 99.00th=[13120], 99.50th=[16768], 99.90th=[25984], 99.95th=[27008],
>       | 99.99th=[28032]
>      bw (KB  /s): min=11864, max=13808, per=100.00%, avg=12864.36, stdev=407.35
>      lat (msec) : 2=0.03%, 4=0.54%, 10=59.79%, 20=39.24%, 50=0.41%
>    cpu          : usr=19.15%, sys=4.69%, ctx=326309, majf=0, minf=426088
>    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=33.9%, 32=66.1%, >=64=0.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>       complete  : 0=0.0%, 4=99.6%, 8=0.4%, 16=0.1%, 32=0.1%, 64=0.0%, >=64=0.0%
>       issued    : total=r=0/w=192862/d=0, short=r=0/w=0/d=0
>       latency   : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
>    WRITE: io=771448KB, aggrb=12855KB/s, minb=12855KB/s, maxb=12855KB/s, mint=60010msec, maxt=60010msec
>
> Disk stats (read/write):
>      dm-1: ios=0/49, merge=0/0, ticks=0/12, in_queue=12, util=0.01%, aggrios=0/22, aggrmerge=0/27, aggrticks=0/12, aggrin_queue=12, aggrutil=0.01%
>    sda: ios=0/22, merge=0/27, ticks=0/12, in_queue=12, util=0.01%
>
> I tried to tweak several parameters like:
>
> filestore_wbthrottle_xfs_ios_start_flusher = 10000 filestore_wbthrottle_xfs_ios_hard_limit = 10000 filestore_wbthrottle_btrfs_ios_start_flusher = 10000 filestore_wbthrottle_btrfs_ios_hard_limit = 10000 filestore queue max ops = 2000
>
> But didn't any improvement.
>
> Then I tried other things:
>
> * Increasing the io_depth up to 256 or 512 gave me between 50 to 100 more IOPS but it's not a realistic workload anymore and not that significant.
> * adding another SSD for the journal, still getting 3,2K IOPS
> * I tried with rbd bench and I also got 3K IOPS
> * I ran the test on a client machine and then locally on the server, still getting 3,2K IOPS
> * put the journal in memory, still getting 3,2K IOPS
> * with 2 clients running the test in parallel I got a total of 3,6K IOPS but I don't seem to be able to go over
> * I tried is to add another OSD to that SSD, so I had 2 OSD and 2 journals on 1 SSD, got 4,5K IOPS YAY!
>
> Given the results of the last time it seems that something is limiting the number of IOPS per OSD process.
>
> Running the test on a client or locally didn't show any difference.
> So it looks to me that there is some contention within Ceph that might cause this.
>
> I also ran perf and looked at the output, everything looks decent, but someone might want to have a look at it :).
>
> We have been able to reproduce this on 3 distinct platforms with some deviations (because of the hardware) but the behaviour is the same.
> Any thoughts will be highly appreciated, only getting 3,2k out of an 29K IOPS SSD is a bit frustrating :).
>
> Cheers.
> ----
> S?bastien Han
> Cloud Architect
>
> "Always give 100%. Unless you're giving blood."
>
> Phone: +33 (0)1 49 70 99 72
> Mail: sebastien.han at enovance.com
> Address : 11 bis, rue Roqu?pine - 75008 Paris Web : www.enovance.com - Twitter : @enovance
>
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>