Re: Extreme slowness in SSD cluster with 3 nodes and 9 OSD with 3.16-3 kernel

Nick Fisk <nick@xxxxxxxxxx> · Mon, 9 Mar 2015 21:42:08 -0000

Can you run the Fio test again but with a queue depth of 32. This will probably show what your cluster is capable of. Adding more nodes with SSD's will probably help scale, but only at higher io depths. At low queue depths you are probably already at the limit as per my earlier email.

From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of mad Engineer
Sent: 09 March 2015 17:23
To: Nick Fisk
Cc: ceph-users
Subject: Re:  Extreme slowness in SSD cluster with 3 nodes and 9 OSD with 3.16-3 kernel

Thank you Nick for explaining the problem with 4k writes.Queue depth used in this setup is 256 the maximum supported.
Can you clarify that adding more nodes will not increase iops.In general how will we increase iops of a ceph cluster.

Thanks for your help

On Sat, Mar 7, 2015 at 5:57 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
You are hitting serial latency limits. For a 4kb sync write to happen it has to:-

1. Travel across network from client to Primary OSD
2. Be processed by Ceph
3. Get Written to Pri OSD
4. Ack travels across network to client

At 4kb these 4 steps take up a very high percentage of the actual processing time as compared to the actual write to the SSD. Apart from faster (more ghz) CPU's which will improve step 2, there's not much that can be done. Future Ceph releases may improve step2 as well, but I wouldn't imagine it will change dramitcally.

Replication level >1 will also see the IOPs drop as you are introducing yet more ceph processing and network delays. Unless a future Ceph feature can be implemented where it returns the ack to client once data has hit the 1st OSD.

Still a 1000 iops, is not that bad.  You mention it needs to achieve 8000 iops to replace your existing SAN, at what queue depth is this required? You are getting way above that at a queue depth of only 16.

I doubt most Ethernet based enterprise SANs would be able to provide 8000 iops at a queue depth of 1, as just network delays would be limiting you to around that figure. A network delay of .1ms will limit you to 10,000 IOPs, .2ms = 5000IOPs and so on.

If you really do need pure SSD performance for a certain client you will need to move the SSD local to it using some sort of caching software running on the client , although this can bring its own challenges.

Nick

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> mad Engineer
> Sent: 07 March 2015 10:55
> To: Somnath Roy
> Cc: ceph-users
> Subject: Re:  Extreme slowness in SSD cluster with 3 nodes and 9
> OSD with 3.16-3 kernel
>
> Update:
>
> Hardware:
> Upgraded RAID controller to LSI Megaraid 9341 -12Gbps
> 3 Samsung 840 EVO - was showing 45K iops for fio test with 7 threads and 4k
> block size in JBOD mode
> CPU- 16 cores @2.27Ghz
> RAM- 24Gb
> NIC- 10Gbits with under 1 ms latency, iperf shows 9.18 Gbps between host
> and client
>
>  Software
> Ubuntu 14.04 with stock kernel 3.13-
> Upgraded from firefly to giant [ceph version 0.87.1
> (283c2e7cfa2457799f534744d7d549f83ea1335e)]
> Changed file system to btrfs and i/o scheduler to noop.
>
> Ceph Setup
> replication to 1 and using 2 SSD OSD and 1 SSD for Journal.All are samsung 840
> EVO in JBOD mode on single server.
>
> Configuration:
> [global]
> fsid = 979f32fc-6f31-43b0-832f-29fcc4c5a648
> mon_initial_members = ceph1
> mon_host = 10.99.10.118
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> filestore_xattr_use_omap = true
> osd_pool_default_size = 1
> osd_pool_default_min_size = 1
> osd_pool_default_pg_num = 250
> osd_pool_default_pgp_num = 250
> debug_lockdep = 0/0
> debug_context = 0/0
> debug_crush = 0/0
> debug_buffer = 0/0
> debug_timer = 0/0
> debug_filer = 0/0
> debug_objecter = 0/0
> debug_rados = 0/0
> debug_rbd = 0/0
> debug_journaler = 0/0
> debug_objectcatcher = 0/0
> debug_client = 0/0
> debug_osd = 0/0
> debug_optracker = 0/0
> debug_objclass = 0/0
> debug_filestore = 0/0
> debug_journal = 0/0
> debug_ms = 0/0
> debug_monc = 0/0
> debug_tp = 0/0
> debug_auth = 0/0
> debug_finisher = 0/0
> debug_heartbeatmap = 0/0
> debug_perfcounter = 0/0
> debug_asok = 0/0
> debug_throttle = 0/0
> debug_mon = 0/0
> debug_paxos = 0/0
> debug_rgw = 0/0
>
> [client]
> rbd_cache = true
>
> Client
> Ubuntu 14.04 with 16 Core @2.53 Ghz and 24G RAM
>
> Results
> rados bench -p rdp -b 4096 -t 16 10 write
>
> rados bench -p rbd -b 4096 -t 16 10 write
>  Maintaining 16 concurrent writes of 4096 bytes for up to 10 seconds or 0
> objects
>  Object prefix: benchmark_data_ubuntucompute_3931
>    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>      0       0         0         0         0         0         -         0
>      1      16      6370      6354   24.8124   24.8203   0.002210.00251512
>      2      16     11618     11602   22.6536      20.5  0.0010250.00275493
>      3      16     16889     16873   21.9637   20.5898  0.0012880.00281797
>      4      16     17310     17294    16.884   1.64453  0.0540660.00365805
>      5      16     17695     17679    13.808   1.50391  0.0014510.00444409
>      6      16     18127     18111   11.7868    1.6875  0.0014630.00527521
>      7      16     21647     21631   12.0669     13.75  0.001601 0.0051773
>      8      16     28056     28040   13.6872   25.0352  0.0052680.00456353
>      9      16     28947     28931    12.553   3.48047   0.066470.00494762
>     10      16     29346     29330   11.4536   1.55859  0.0013410.00542312
>  Total time run:         10.077931
> Total writes made:      29347
> Write size:             4096
> Bandwidth (MB/sec):     11.375
>
> Stddev Bandwidth:       10.5124
> Max bandwidth (MB/sec): 25.0352
> Min bandwidth (MB/sec): 0
> Average Latency:        0.00548729
> Stddev Latency:         0.0169545
> Max latency:            0.249019
> Min latency:            0.000748
>
> ceph -s
>     cluster 979f32fc-6f31-43b0-832f-29fcc4c5a648
>      health HEALTH_OK
>      monmap e1: 1 mons at {ceph1=10.99.10.118:6789/0}, election epoch 1,
> quorum 0 ceph1
>      osdmap e30: 2 osds: 2 up, 2 in
>       pgmap v255: 250 pgs, 1 pools, 92136 kB data, 23035 objects
>             77068 kB used, 929 GB / 931 GB avail
>                  250 active+clean
>   client io 11347 kB/s wr, 2836 op/s
>
> iostat
> device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
> sda               6.00         0.00       112.00          0        448
> sdb            3985.50         0.00     21048.00          0      84192
> sdd             414.50         0.00     14083.00          0      56332
> sdc             415.00         0.00     10944.00          0      43776
>
> where
>
> sdb - journal
> sdc,sdd - OSD
>
> dd output
> dd if=/dev/zero of=/dev/rbd0 bs=4k count=25000 oflag=direct
> 25000+0 records in
> 25000+0 records out
> 102400000 bytes (102 MB) copied, 23.0863 s, 4.4 MB/s
>
> here performance has increased from 1MBps to 4.4MBps but not what i was
> expecting.
>
> fio with 4k writes with 2 threads
> journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync,
> iodepth=1
> journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync,
> iodepth=1
> fio-2.1.3
> Starting 2 processes
> Jobs: 2 (f=2): [WW] [1.4% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
> 01h:27m:25s]
> journal-test: (groupid=0, jobs=2): err= 0: pid=4077: Sat Mar  7 02:50:45 2015
>   write: io=292936KB, bw=3946.1KB/s, iops=986, runt= 74236msec
>     clat (usec): min=645, max=16855K, avg=2023.56, stdev=88071.07
>      lat (usec): min=645, max=16855K, avg=2023.97, stdev=88071.07
>     clat percentiles (usec):
>      |  1.00th=[  884],  5.00th=[ 1192], 10.00th=[ 1304], 20.00th=[ 1448],
>      | 30.00th=[ 1512], 40.00th=[ 1560], 50.00th=[ 1592], 60.00th=[ 1624],
>      | 70.00th=[ 1656], 80.00th=[ 1704], 90.00th=[ 1752], 95.00th=[ 1816],
>      | 99.00th=[ 1928], 99.50th=[ 1992], 99.90th=[ 2160], 99.95th=[ 2288],
>      | 99.99th=[39168]
>     bw (KB  /s): min=   54, max= 3568, per=64.10%, avg=2529.43, stdev=315.56
>     lat (usec) : 750=0.07%, 1000=2.53%
>     lat (msec) : 2=96.96%, 4=0.43%, 50=0.01%, >=2000=0.01%
>   cpu          : usr=0.51%, sys=2.04%, ctx=73550, majf=0, minf=93
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>      issued    : total=r=0/w=73234/d=0, short=r=0/w=0/d=0
>
> Run status group 0 (all jobs):
>   WRITE: io=292936KB, aggrb=3946KB/s, minb=3946KB/s, maxb=3946KB/s,
> mint=74236msec, maxt=74236msec
>
> Disk stats (read/write):
>   rbd0: ios=186/73232, merge=0/0, ticks=120/109676, in_queue=143448,
> util=100.00%
>
> How can i improve performance of 4k write? Will adding more Nodes
> improve this
>
> Thanks for any help

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com