Can you run the Fio test again but with a queue depth of 32. This will probably show what your cluster is capable of. Adding more nodes with SSD's will probably help scale, but only at higher io depths. At low queue depths you are probably already at the limit as per my earlier email. From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of mad Engineer Sent: 09 March 2015 17:23 To: Nick Fisk Cc: ceph-users Subject: Re: Extreme slowness in SSD cluster with 3 nodes and 9 OSD with 3.16-3 kernel Thank you Nick for explaining the problem with 4k writes.Queue depth used in this setup is 256 the maximum supported. Can you clarify that adding more nodes will not increase iops.In general how will we increase iops of a ceph cluster. Thanks for your help On Sat, Mar 7, 2015 at 5:57 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: You are hitting serial latency limits. For a 4kb sync write to happen it has to:- 1. Travel across network from client to Primary OSD 2. Be processed by Ceph 3. Get Written to Pri OSD 4. Ack travels across network to client At 4kb these 4 steps take up a very high percentage of the actual processing time as compared to the actual write to the SSD. Apart from faster (more ghz) CPU's which will improve step 2, there's not much that can be done. Future Ceph releases may improve step2 as well, but I wouldn't imagine it will change dramitcally. Replication level >1 will also see the IOPs drop as you are introducing yet more ceph processing and network delays. Unless a future Ceph feature can be implemented where it returns the ack to client once data has hit the 1st OSD. Still a 1000 iops, is not that bad. You mention it needs to achieve 8000 iops to replace your existing SAN, at what queue depth is this required? You are getting way above that at a queue depth of only 16. I doubt most Ethernet based enterprise SANs would be able to provide 8000 iops at a queue depth of 1, as just network delays would be limiting you to around that figure. A network delay of .1ms will limit you to 10,000 IOPs, .2ms = 5000IOPs and so on. If you really do need pure SSD performance for a certain client you will need to move the SSD local to it using some sort of caching software running on the client , although this can bring its own challenges. Nick > -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of > mad Engineer > Sent: 07 March 2015 10:55 > To: Somnath Roy > Cc: ceph-users > Subject: Re: Extreme slowness in SSD cluster with 3 nodes and 9 > OSD with 3.16-3 kernel > > Update: > > Hardware: > Upgraded RAID controller to LSI Megaraid 9341 -12Gbps > 3 Samsung 840 EVO - was showing 45K iops for fio test with 7 threads and 4k > block size in JBOD mode > CPU- 16 cores @2.27Ghz > RAM- 24Gb > NIC- 10Gbits with under 1 ms latency, iperf shows 9.18 Gbps between host > and client > > Software > Ubuntu 14.04 with stock kernel 3.13- > Upgraded from firefly to giant [ceph version 0.87.1 > (283c2e7cfa2457799f534744d7d549f83ea1335e)] > Changed file system to btrfs and i/o scheduler to noop. > > Ceph Setup > replication to 1 and using 2 SSD OSD and 1 SSD for Journal.All are samsung 840 > EVO in JBOD mode on single server. > > Configuration: > [global] > fsid = 979f32fc-6f31-43b0-832f-29fcc4c5a648 > mon_initial_members = ceph1 > mon_host = 10.99.10.118 > auth_cluster_required = cephx > auth_service_required = cephx > auth_client_required = cephx > filestore_xattr_use_omap = true > osd_pool_default_size = 1 > osd_pool_default_min_size = 1 > osd_pool_default_pg_num = 250 > osd_pool_default_pgp_num = 250 > debug_lockdep = 0/0 > debug_context = 0/0 > debug_crush = 0/0 > debug_buffer = 0/0 > debug_timer = 0/0 > debug_filer = 0/0 > debug_objecter = 0/0 > debug_rados = 0/0 > debug_rbd = 0/0 > debug_journaler = 0/0 > debug_objectcatcher = 0/0 > debug_client = 0/0 > debug_osd = 0/0 > debug_optracker = 0/0 > debug_objclass = 0/0 > debug_filestore = 0/0 > debug_journal = 0/0 > debug_ms = 0/0 > debug_monc = 0/0 > debug_tp = 0/0 > debug_auth = 0/0 > debug_finisher = 0/0 > debug_heartbeatmap = 0/0 > debug_perfcounter = 0/0 > debug_asok = 0/0 > debug_throttle = 0/0 > debug_mon = 0/0 > debug_paxos = 0/0 > debug_rgw = 0/0 > > [client] > rbd_cache = true > > Client > Ubuntu 14.04 with 16 Core @2.53 Ghz and 24G RAM > > Results > rados bench -p rdp -b 4096 -t 16 10 write > > rados bench -p rbd -b 4096 -t 16 10 write > Maintaining 16 concurrent writes of 4096 bytes for up to 10 seconds or 0 > objects > Object prefix: benchmark_data_ubuntucompute_3931 > sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > 0 0 0 0 0 0 - 0 > 1 16 6370 6354 24.8124 24.8203 0.002210.00251512 > 2 16 11618 11602 22.6536 20.5 0.0010250.00275493 > 3 16 16889 16873 21.9637 20.5898 0.0012880.00281797 > 4 16 17310 17294 16.884 1.64453 0.0540660.00365805 > 5 16 17695 17679 13.808 1.50391 0.0014510.00444409 > 6 16 18127 18111 11.7868 1.6875 0.0014630.00527521 > 7 16 21647 21631 12.0669 13.75 0.001601 0.0051773 > 8 16 28056 28040 13.6872 25.0352 0.0052680.00456353 > 9 16 28947 28931 12.553 3.48047 0.066470.00494762 > 10 16 29346 29330 11.4536 1.55859 0.0013410.00542312 > Total time run: 10.077931 > Total writes made: 29347 > Write size: 4096 > Bandwidth (MB/sec): 11.375 > > Stddev Bandwidth: 10.5124 > Max bandwidth (MB/sec): 25.0352 > Min bandwidth (MB/sec): 0 > Average Latency: 0.00548729 > Stddev Latency: 0.0169545 > Max latency: 0.249019 > Min latency: 0.000748 > > ceph -s > cluster 979f32fc-6f31-43b0-832f-29fcc4c5a648 > health HEALTH_OK > monmap e1: 1 mons at {ceph1=10.99.10.118:6789/0}, election epoch 1, > quorum 0 ceph1 > osdmap e30: 2 osds: 2 up, 2 in > pgmap v255: 250 pgs, 1 pools, 92136 kB data, 23035 objects > 77068 kB used, 929 GB / 931 GB avail > 250 active+clean > client io 11347 kB/s wr, 2836 op/s > > iostat > device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn > sda 6.00 0.00 112.00 0 448 > sdb 3985.50 0.00 21048.00 0 84192 > sdd 414.50 0.00 14083.00 0 56332 > sdc 415.00 0.00 10944.00 0 43776 > > where > > sdb - journal > sdc,sdd - OSD > > dd output > dd if=/dev/zero of=/dev/rbd0 bs=4k count=25000 oflag=direct > 25000+0 records in > 25000+0 records out > 102400000 bytes (102 MB) copied, 23.0863 s, 4.4 MB/s > > here performance has increased from 1MBps to 4.4MBps but not what i was > expecting. > > fio with 4k writes with 2 threads > journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, > iodepth=1 > journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, > iodepth=1 > fio-2.1.3 > Starting 2 processes > Jobs: 2 (f=2): [WW] [1.4% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta > 01h:27m:25s] > journal-test: (groupid=0, jobs=2): err= 0: pid=4077: Sat Mar 7 02:50:45 2015 > write: io=292936KB, bw=3946.1KB/s, iops=986, runt= 74236msec > clat (usec): min=645, max=16855K, avg=2023.56, stdev=88071.07 > lat (usec): min=645, max=16855K, avg=2023.97, stdev=88071.07 > clat percentiles (usec): > | 1.00th=[ 884], 5.00th=[ 1192], 10.00th=[ 1304], 20.00th=[ 1448], > | 30.00th=[ 1512], 40.00th=[ 1560], 50.00th=[ 1592], 60.00th=[ 1624], > | 70.00th=[ 1656], 80.00th=[ 1704], 90.00th=[ 1752], 95.00th=[ 1816], > | 99.00th=[ 1928], 99.50th=[ 1992], 99.90th=[ 2160], 99.95th=[ 2288], > | 99.99th=[39168] > bw (KB /s): min= 54, max= 3568, per=64.10%, avg=2529.43, stdev=315.56 > lat (usec) : 750=0.07%, 1000=2.53% > lat (msec) : 2=96.96%, 4=0.43%, 50=0.01%, >=2000=0.01% > cpu : usr=0.51%, sys=2.04%, ctx=73550, majf=0, minf=93 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, > >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > issued : total=r=0/w=73234/d=0, short=r=0/w=0/d=0 > > Run status group 0 (all jobs): > WRITE: io=292936KB, aggrb=3946KB/s, minb=3946KB/s, maxb=3946KB/s, > mint=74236msec, maxt=74236msec > > Disk stats (read/write): > rbd0: ios=186/73232, merge=0/0, ticks=120/109676, in_queue=143448, > util=100.00% > > How can i improve performance of 4k write? Will adding more Nodes > improve this > > Thanks for any help _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com