[Single OSD performance on SSD] Can't go over 3, 2K IOPS

sebastien.han@xxxxxxxxxxxx (Sebastien Han) · Fri, 29 Aug 2014 12:17:26 +0200

Thanks a lot for the answers, even if we drifted from the main subject a little bit.
Thanks Somnath for sharing this, when can we expect any codes that might improve _write_ performance?

@Mark thanks trying this :)
Unfortunately using nobarrier and another dedicated SSD for the journal  (plus your ceph setting) didn?t bring much, now I can reach 3,5K IOPS.
By any chance, would it be possible for you to test with a single OSD SSD?

On 28 Aug 2014, at 18:11, Sebastien Han <sebastien.han at enovance.com> wrote:

> Hey all,
> 
> It has been a while since the last thread performance related on the ML :p
> I?ve been running some experiment to see how much I can get from an SSD on a Ceph cluster.
> To achieve that I did something pretty simple:
> 
> * Debian wheezy 7.6
> * kernel from debian 3.14-0.bpo.2-amd64
> * 1 cluster, 3 mons (i?d like to keep this realistic since in a real deployment i?ll use 3)
> * 1 OSD backed by an SSD (journal and osd data on the same device)
> * 1 replica count of 1
> * partitions are perfectly aligned
> * io scheduler is set to noon but deadline was showing the same results
> * no updatedb running
> 
> About the box:
> 
> * 32GB of RAM
> * 12 cores with HT @ 2,4 GHz
> * WB cache is enabled on the controller
> * 10Gbps network (doesn?t help here)
> 
> The SSD is a 200G Intel DC S3700 and is capable of delivering around 29K iops with random 4k writes (my fio results)
> As a benchmark tool I used fio with the rbd engine (thanks deutsche telekom guys!).
> 
> O_DIECT and D_SYNC don?t seem to be a problem for the SSD:
> 
> # dd if=/dev/urandom of=rand.file bs=4k count=65536
> 65536+0 records in
> 65536+0 records out
> 268435456 bytes (268 MB) copied, 29.5477 s, 9.1 MB/s
> 
> # du -sh rand.file
> 256M    rand.file
> 
> # dd if=rand.file of=/dev/sdo bs=4k count=65536 oflag=dsync,direct
> 65536+0 records in
> 65536+0 records out
> 268435456 bytes (268 MB) copied, 2.73628 s, 98.1 MB/s
> 
> See my ceph.conf:
> 
> [global]
>  auth cluster required = cephx
>  auth service required = cephx
>  auth client required = cephx
>  fsid = 857b8609-8c9b-499e-9161-2ea67ba51c97
>  osd pool default pg num = 4096
>  osd pool default pgp num = 4096
>  osd pool default size = 2
>  osd crush chooseleaf type = 0
> 
>   debug lockdep = 0/0
>        debug context = 0/0
>        debug crush = 0/0
>        debug buffer = 0/0
>        debug timer = 0/0
>        debug journaler = 0/0
>        debug osd = 0/0
>        debug optracker = 0/0
>        debug objclass = 0/0
>        debug filestore = 0/0
>        debug journal = 0/0
>        debug ms = 0/0
>        debug monc = 0/0
>        debug tp = 0/0
>        debug auth = 0/0
>        debug finisher = 0/0
>        debug heartbeatmap = 0/0
>        debug perfcounter = 0/0
>        debug asok = 0/0
>        debug throttle = 0/0
> 
> [mon]
>  mon osd down out interval = 600
>  mon osd min down reporters = 13
>    [mon.ceph-01]
>    host = ceph-01
>    mon addr = 172.20.20.171
>      [mon.ceph-02]
>    host = ceph-02
>    mon addr = 172.20.20.172
>      [mon.ceph-03]
>    host = ceph-03
>    mon addr = 172.20.20.173
> 
>        debug lockdep = 0/0
>        debug context = 0/0
>        debug crush = 0/0
>        debug buffer = 0/0
>        debug timer = 0/0
>        debug journaler = 0/0
>        debug osd = 0/0
>        debug optracker = 0/0
>        debug objclass = 0/0
>        debug filestore = 0/0
>        debug journal = 0/0
>        debug ms = 0/0
>        debug monc = 0/0
>        debug tp = 0/0
>        debug auth = 0/0
>        debug finisher = 0/0
>        debug heartbeatmap = 0/0
>        debug perfcounter = 0/0
>        debug asok = 0/0
>        debug throttle = 0/0
> 
> [osd]
>  osd mkfs type = xfs
> osd mkfs options xfs = -f -i size=2048
> osd mount options xfs = rw,noatime,logbsize=256k,delaylog
>  osd journal size = 20480
>  cluster_network = 172.20.20.0/24
>  public_network = 172.20.20.0/24
>  osd mon heartbeat interval = 30
>  # Performance tuning
>  filestore merge threshold = 40
>  filestore split multiple = 8
>  osd op threads = 8
>  # Recovery tuning
>  osd recovery max active = 1
>  osd max backfills = 1
>  osd recovery op priority = 1
> 
> 
>        debug lockdep = 0/0
>        debug context = 0/0
>        debug crush = 0/0
>        debug buffer = 0/0
>        debug timer = 0/0
>        debug journaler = 0/0
>        debug osd = 0/0
>        debug optracker = 0/0
>        debug objclass = 0/0
>        debug filestore = 0/0
>        debug journal = 0/0
>        debug ms = 0/0
>        debug monc = 0/0
>        debug tp = 0/0
>        debug auth = 0/0
>        debug finisher = 0/0
>        debug heartbeatmap = 0/0
>        debug perfcounter = 0/0
>        debug asok = 0/0
>        debug throttle = 0/0
> 
> Disabling all debugging made me win 200/300 more IOPS.
> 
> See my fio template:
> 
> [global]
> #logging
> #write_iops_log=write_iops_log
> #write_bw_log=write_bw_log
> #write_lat_log=write_lat_lo
> 
> time_based
> runtime=60
> 
> ioengine=rbd
> clientname=admin
> pool=test
> rbdname=fio
> invalidate=0    # mandatory
> #rw=randwrite
> rw=write
> bs=4k
> #bs=32m
> size=5G
> group_reporting
> 
> [rbd_iodepth32]
> iodepth=32
> direct=1
> 
> See my rio output:
> 
> rbd_iodepth32: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32
> fio-2.1.11-14-gb74e
> Starting 1 process
> rbd engine: RBD version: 0.1.8
> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/12876KB/0KB /s] [0/3219/0 iops] [eta 00m:00s]
> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=32116: Thu Aug 28 00:28:26 2014
>  write: io=771448KB, bw=12855KB/s, iops=3213, runt= 60010msec
>    slat (usec): min=42, max=1578, avg=66.50, stdev=16.96
>    clat (msec): min=1, max=28, avg= 9.85, stdev= 1.48
>     lat (msec): min=1, max=28, avg= 9.92, stdev= 1.47
>    clat percentiles (usec):
>     |  1.00th=[ 6368],  5.00th=[ 8256], 10.00th=[ 8640], 20.00th=[ 9152],
>     | 30.00th=[ 9408], 40.00th=[ 9664], 50.00th=[ 9792], 60.00th=[10048],
>     | 70.00th=[10176], 80.00th=[10560], 90.00th=[10944], 95.00th=[11456],
>     | 99.00th=[13120], 99.50th=[16768], 99.90th=[25984], 99.95th=[27008],
>     | 99.99th=[28032]
>    bw (KB  /s): min=11864, max=13808, per=100.00%, avg=12864.36, stdev=407.35
>    lat (msec) : 2=0.03%, 4=0.54%, 10=59.79%, 20=39.24%, 50=0.41%
>  cpu          : usr=19.15%, sys=4.69%, ctx=326309, majf=0, minf=426088
>  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=33.9%, 32=66.1%, >=64=0.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     complete  : 0=0.0%, 4=99.6%, 8=0.4%, 16=0.1%, 32=0.1%, 64=0.0%, >=64=0.0%
>     issued    : total=r=0/w=192862/d=0, short=r=0/w=0/d=0
>     latency   : target=0, window=0, percentile=100.00%, depth=32
> 
> Run status group 0 (all jobs):
>  WRITE: io=771448KB, aggrb=12855KB/s, minb=12855KB/s, maxb=12855KB/s, mint=60010msec, maxt=60010msec
> 
> Disk stats (read/write):
>    dm-1: ios=0/49, merge=0/0, ticks=0/12, in_queue=12, util=0.01%, aggrios=0/22, aggrmerge=0/27, aggrticks=0/12, aggrin_queue=12, aggrutil=0.01%
>  sda: ios=0/22, merge=0/27, ticks=0/12, in_queue=12, util=0.01%
> 
> I tried to tweak several parameters like:
> 
> filestore_wbthrottle_xfs_ios_start_flusher = 10000
> filestore_wbthrottle_xfs_ios_hard_limit = 10000
> filestore_wbthrottle_btrfs_ios_start_flusher = 10000
> filestore_wbthrottle_btrfs_ios_hard_limit = 10000
> filestore queue max ops = 2000
> 
> But didn?t any improvement.
> 
> Then I tried other things:
> 
> * Increasing the io_depth up to 256 or 512 gave me between 50 to 100 more IOPS but it?s not a realistic workload anymore and not that significant.
> * adding another SSD for the journal, still getting 3,2K IOPS
> * I tried with rbd bench and I also got 3K IOPS
> * I ran the test on a client machine and then locally on the server, still getting 3,2K IOPS
> * put the journal in memory, still getting 3,2K IOPS
> * with 2 clients running the test in parallel I got a total of 3,6K IOPS but I don?t seem to be able to go over
> * I tried is to add another OSD to that SSD, so I had 2 OSD and 2 journals on 1 SSD, got 4,5K IOPS YAY!
> 
> Given the results of the last time it seems that something is limiting the number of IOPS per OSD process.
> 
> Running the test on a client or locally didn?t show any difference.
> So it looks to me that there is some contention within Ceph that might cause this.
> 
> I also ran perf and looked at the output, everything looks decent, but someone might want to have a look at it :).
> 
> We have been able to reproduce this on 3 distinct platforms with some deviations (because of the hardware) but the behaviour is the same.
> Any thoughts will be highly appreciated, only getting 3,2k out of an 29K IOPS SSD is a bit frustrating :).
> 
> Cheers.
> ???? 
> S?bastien Han 
> Cloud Architect 
> 
> "Always give 100%. Unless you're giving blood."
> 
> Phone: +33 (0)1 49 70 99 72 
> Mail: sebastien.han at enovance.com 
> Address : 11 bis, rue Roqu?pine - 75008 Paris
> Web : www.enovance.com - Twitter : @enovance 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Cheers.
???? 
S?bastien Han 
Cloud Architect 

"Always give 100%. Unless you're giving blood."

Phone: +33 (0)1 49 70 99 72 
Mail: sebastien.han at enovance.com 
Address : 11 bis, rue Roqu?pine - 75008 Paris
Web : www.enovance.com - Twitter : @enovance 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 496 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140829/8d1043f0/attachment.pgp>