Re: Extreme slowness in SSD cluster with 3 nodes and 9 OSD with 3.16-3 kernel

mad Engineer <themadengin33r@xxxxxxxxx> · Sat, 7 Mar 2015 16:25:02 +0530

Update:
Hardware:
Upgraded RAID controller to LSI Megaraid 9341 -12Gbps
3 Samsung 840 EVO - was showing 45K iops for fio test with 7 threads and 4k block size in JBOD mode
CPU- 16 cores @2.27Ghz 
RAM- 24Gb
NIC- 10Gbits with under 1 ms latency, iperf shows 9.18 Gbps between host and client

 Software
Ubuntu 14.04 with stock kernel 3.13-
Upgraded from firefly to giant [ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)]
Changed file system to btrfs and i/o scheduler to noop.

Ceph Setup
replication to 1 and using 2 SSD OSD and 1 SSD for Journal.All are samsung 840 EVO in JBOD mode on single server.

Configuration:
[global]
fsid = 979f32fc-6f31-43b0-832f-29fcc4c5a648
mon_initial_members = ceph1
mon_host = 10.99.10.118
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd_pool_default_size = 1
osd_pool_default_min_size = 1
osd_pool_default_pg_num = 250
osd_pool_default_pgp_num = 250
debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0

[client]
rbd_cache = true

Client
Ubuntu 14.04 with 16 Core @2.53 Ghz and 24G RAM

Results
rados bench -p rdp -b 4096 -t 16 10 write

rados bench -p rbd -b 4096 -t 16 10 write
 Maintaining 16 concurrent writes of 4096 bytes for up to 10 seconds or 0 objects
 Object prefix: benchmark_data_ubuntucompute_3931
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1      16      6370      6354   24.8124   24.8203   0.002210.00251512
     2      16     11618     11602   22.6536      20.5  0.0010250.00275493
     3      16     16889     16873   21.9637   20.5898  0.0012880.00281797
     4      16     17310     17294    16.884   1.64453  0.0540660.00365805
     5      16     17695     17679    13.808   1.50391  0.0014510.00444409
     6      16     18127     18111   11.7868    1.6875  0.0014630.00527521
     7      16     21647     21631   12.0669     13.75  0.001601 0.0051773
     8      16     28056     28040   13.6872   25.0352  0.0052680.00456353
     9      16     28947     28931    12.553   3.48047   0.066470.00494762
    10      16     29346     29330   11.4536   1.55859  0.0013410.00542312
 Total time run:         10.077931
Total writes made:      29347
Write size:             4096
Bandwidth (MB/sec):     11.375

Stddev Bandwidth:       10.5124
Max bandwidth (MB/sec): 25.0352
Min bandwidth (MB/sec): 0
Average Latency:        0.00548729
Stddev Latency:         0.0169545
Max latency:            0.249019
Min latency:            0.000748

ceph -s
    cluster 979f32fc-6f31-43b0-832f-29fcc4c5a648
     health HEALTH_OK
     monmap e1: 1 mons at {ceph1=10.99.10.118:6789/0}, election epoch 1, quorum 0 ceph1
     osdmap e30: 2 osds: 2 up, 2 in
      pgmap v255: 250 pgs, 1 pools, 92136 kB data, 23035 objects
            77068 kB used, 929 GB / 931 GB avail
                 250 active+clean
  client io 11347 kB/s wr, 2836 op/s

iostat
device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               6.00         0.00       112.00          0        448
sdb            3985.50         0.00     21048.00          0      84192
sdd             414.50         0.00     14083.00          0      56332
sdc             415.00         0.00     10944.00          0      43776

where

sdb - journal
sdc,sdd - OSD

dd output
dd if=/dev/zero of=/dev/rbd0 bs=4k count=25000 oflag=direct
25000+0 records in
25000+0 records out
102400000 bytes (102 MB) copied, 23.0863 s, 4.4 MB/s

here performance has increased from 1MBps to 4.4MBps but not what i was expecting.

fio with 4k writes with 2 threads
journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.3
Starting 2 processes
Jobs: 2 (f=2): [WW] [1.4% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01h:27m:25s]
journal-test: (groupid=0, jobs=2): err= 0: pid=4077: Sat Mar  7 02:50:45 2015
  write: io=292936KB, bw=3946.1KB/s, iops=986, runt= 74236msec
    clat (usec): min=645, max=16855K, avg=2023.56, stdev=88071.07
     lat (usec): min=645, max=16855K, avg=2023.97, stdev=88071.07
    clat percentiles (usec):
     |  1.00th=[  884],  5.00th=[ 1192], 10.00th=[ 1304], 20.00th=[ 1448],
     | 30.00th=[ 1512], 40.00th=[ 1560], 50.00th=[ 1592], 60.00th=[ 1624],
     | 70.00th=[ 1656], 80.00th=[ 1704], 90.00th=[ 1752], 95.00th=[ 1816],
     | 99.00th=[ 1928], 99.50th=[ 1992], 99.90th=[ 2160], 99.95th=[ 2288],
     | 99.99th=[39168]
    bw (KB  /s): min=   54, max= 3568, per=64.10%, avg=2529.43, stdev=315.56
    lat (usec) : 750=0.07%, 1000=2.53%
    lat (msec) : 2=96.96%, 4=0.43%, 50=0.01%, >=2000=0.01%
  cpu          : usr=0.51%, sys=2.04%, ctx=73550, majf=0, minf=93
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=73234/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=292936KB, aggrb=3946KB/s, minb=3946KB/s, maxb=3946KB/s, mint=74236msec, maxt=74236msec

Disk stats (read/write):
  rbd0: ios=186/73232, merge=0/0, ticks=120/109676, in_queue=143448, util=100.00%

How can i improve performance of 4k write? Will adding more Nodes improve this

Thanks for any help

On Sun, Mar 1, 2015 at 3:07 AM, Somnath Roy <Somnath.Roy@xxxxxxxxxxx> wrote:

Sorry, I saw you have already tried with ‘rados bench’. So, some points here.

1. If you are considering write workload, I think with total of 2 copies and with 4K workload , you should be able to get ~4K iops (considering it hitting the
 disk, not with memstore). 

2. You are having 9 OSDs and if you created only one pool with only 450 PGS, you should try to increase that and see if getting any improvement or not.

3. Also, the rados bench script you ran with very low QD, try increasing that, may be 32/64.

4. If you are running firefly, other optimization won’t work here..But, you can add the following in your ceph.conf file and it should give you some boost.

debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0

5. Give us the ceph –s output and the iostat output while io is going on.

Thanks & Regards
Somnath

From: Somnath Roy

Sent: Saturday, February 28, 2015 12:59 PM

To: 'mad Engineer'; Alexandre DERUMIER

Cc: ceph-users

Subject: RE:  Extreme slowness in SSD cluster with 3 nodes and 9 OSD with 3.16-3 kernel

I would say check with rados tool like ceph_smalliobench/rados bench first to see how much performance these tools are reporting. This will help you to isolate
 any upstream issues.
Also, check with ‘iostat –xk 1’ for the resource utilization. Hope you are running with powerful enough cpu complex since you are saying network is not a bottleneck.

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
On Behalf Of mad Engineer

Sent: Saturday, February 28, 2015 12:29 PM

To: Alexandre DERUMIER

Cc: ceph-users

Subject: Re:  Extreme slowness in SSD cluster with 3 nodes and 9 OSD with 3.16-3 kernel

reinstalled ceph packages and now with memstore backend [osd objectstore =memstore] its giving 400Kbps .No idea where the problem is.

On Sun, Mar 1, 2015 at 12:30 AM, mad Engineer <themadengin33r@xxxxxxxxx> wrote:

tried changing scheduler from deadline to noop also upgraded to Gaint and btrfs filesystem,downgraded kernel to 3.16 from 3.16-3 not much difference

dd if=/dev/zero of=hi bs=4k count=25000 oflag=direct

25000+0 records in

25000+0 records out

102400000 bytes (102 MB) copied, 94.691 s, 1.1 MB/s

Earlier on a vmware setup i was getting ~850 KBps and now even on physical server with SSD drives its just over 1MBps.I doubt some serious configuration issues.

Tried iperf between 3 servers all are showing 9 Gbps,tried icmp with different packet size ,no fragmentation.

i also noticed that out of 9 osd 5 are 850 EVO and 4 are 840 EVO.I believe this will not cause this much drop in performance.

Thanks for any help

On Sat, Feb 28, 2015 at 6:49 PM, Alexandre DERUMIER <aderumier@xxxxxxxxx> wrote:
As optimisation,

try to set ioscheduler to noop,

and also enable rbd_cache=true. (It's really helping for for sequential writes)

but your results seem quite low, 926kb/s with 4k, it's only 200io/s.

check if you don't have any big network latencies, or mtu fragementation problem.

Maybe also try to bench with fio, with more parallel jobs.

----- Mail original -----

De: "mad Engineer" <themadengin33r@xxxxxxxxx>

À: "Philippe Schwarz" <phil@xxxxxxxxxxxxxx>

Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>

Envoyé: Samedi 28 Février 2015 13:06:59

Objet: Re:  Extreme slowness in SSD cluster with 3 nodes and 9 OSD with 3.16-3 kernel

Thanks for the reply Philippe,we were using these disks in our NAS,now

it looks like i am in big trouble :-(

On Sat, Feb 28, 2015 at 5:02 PM, Philippe Schwarz <phil@xxxxxxxxxxxxxx> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----

> Hash: SHA1

>

> Le 28/02/2015 12:19, mad Engineer a écrit :

>> Hello All,

>>

>> I am trying ceph-firefly 0.80.8

>> (69eaad7f8308f21573c604f121956e64679a52a7) with 9 OSD ,all Samsung

>> SSD 850 EVO on 3 servers with 24 G RAM,16 cores @2.27 Ghz Ubuntu

>> 14.04 LTS with 3.16-3 kernel.All are connected to 10G ports with

>> maximum MTU.There are no extra disks for journaling and also there

>> are no separate network for replication and data transfer.All 3

>> nodes are also hosting monitoring process.Operating system runs on

>> SATA disk.

>>

>> When doing a sequential benchmark using "dd" on RBD, mounted on

>> client as ext4 its taking 110s to write 100Mb data at an average

>> speed of 926Kbps.

>>

>> time dd if=/dev/zero of=hello bs=4k count=25000 oflag=direct

>> 25000+0 records in 25000+0 records out 102400000 bytes (102 MB)

>> copied, 110.582 s, 926 kB/s

>>

>> real 1m50.585s user 0m0.106s sys 0m2.233s

>>

>> While doing this directly on ssd mount point shows:

>>

>> time dd if=/dev/zero of=hello bs=4k count=25000 oflag=direct

>> 25000+0 records in 25000+0 records out 102400000 bytes (102 MB)

>> copied, 1.38567 s, 73.9 MB/s

>>

>> OSDs are in XFS with these extra arguments :

>>

>> rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M

>>

>> ceph.conf

>>

>> [global] fsid = 7d889081-7826-439c-9fe5-d4e57480d9be

>> mon_initial_members = ceph1, ceph2, ceph3 mon_host =

>> 10.99.10.118,10.99.10.119,10.99.10.120 auth_cluster_required =

>> cephx auth_service_required = cephx auth_client_required = cephx

>> filestore_xattr_use_omap = true osd_pool_default_size = 2

>> osd_pool_default_min_size = 2 osd_pool_default_pg_num = 450

>> osd_pool_default_pgp_num = 450 max_open_files = 131072

>>

>> [osd] osd_mkfs_type = xfs osd_op_threads = 8 osd_disk_threads = 4

>> osd_mount_options_xfs =

>> "rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M"

>>

>>

>> on our traditional storage with Full SAS disk, same "dd" completes

>> in 16s with an average write speed of 6Mbps.

>>

>> Rados bench:

>>

>> rados bench -p rbd 10 write Maintaining 16 concurrent writes of

>> 4194304 bytes for up to 10 seconds or 0 objects Object prefix:

>> benchmark_data_ceph1_2977 sec Cur ops started finished avg MB/s

>> cur MB/s last lat avg lat 0 0 0 0

>> 0 0 - 0 1 16 94 78

>> 311.821 312 0.041228 0.140132 2 16 192 176

>> 351.866 392 0.106294 0.175055 3 16 275 259

>> 345.216 332 0.076795 0.166036 4 16 302 286

>> 285.912 108 0.043888 0.196419 5 16 395 379

>> 303.11 372 0.126033 0.207488 6 16 501 485

>> 323.242 424 0.125972 0.194559 7 16 621 605

>> 345.621 480 0.194155 0.183123 8 16 730 714

>> 356.903 436 0.086678 0.176099 9 16 814 798

>> 354.572 336 0.081567 0.174786 10 16 832

>> 816 326.313 72 0.037431 0.182355 11 16 833

>> 817 297.013 4 0.533326 0.182784 Total time run:

>> 11.489068 Total writes made: 833 Write size:

>> 4194304 Bandwidth (MB/sec): 290.015

>>

>> Stddev Bandwidth: 175.723 Max bandwidth (MB/sec): 480 Min

>> bandwidth (MB/sec): 0 Average Latency: 0.220582 Stddev

>> Latency: 0.343697 Max latency: 2.85104 Min

>> latency: 0.035381

>>

>> Our ultimate aim is to replace existing SAN with ceph,but for that

>> it should meet minimum 8000 iops.Can any one help me with this,OSD

>> are SSD,CPU has good clock speed,backend network is good but still

>> we are not able to extract full capability of SSD disks.

>>

>>

>>

>> Thanks,

>

> Hi, i'm new to ceph so, don't consider my words as holy truth.

>

> It seems that Samsung 840 (so i assume 850) are crappy for ceph :

>

> MTBF :

> 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-November/044258.html

> Bandwidth

> :http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-December/045247.html

>

> And according to a confirmed user of Ceph/ProxmoX, Samsung SSDs should

> be avoided if possible in ceph storage.

>

> Apart from that, it seems there was an limitation in ceph for the use

> of the complete bandwidth available in SSDs; but i think with less

> than 1Mb/s you haven't hit this limit.

>

> I remind you that i'm not a ceph-guru (far from that, indeed), so feel

> free to disagree; i'm on the way to improve my knowledge.

>

> Best regards.

>

>

>

>

> -----BEGIN PGP SIGNATURE-----

> Version: GnuPG v1

>

> iEYEARECAAYFAlTxp0UACgkQlhqCFkbqHRb5+wCgrXCM3VsnVE6PCbbpOmQXCXbr

> 8u0An2BUgZWismSK0PxbwVDOD5+/UWik

> =0o0v

> -----END PGP SIGNATURE-----

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this
 message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy
 any and all copies of this message in your possession (whether hard copies or electronically stored copies).

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com