Re: Luminous - bad performance

"Marc Roos" <M.Roos@xxxxxxxxxxxxxxxxx> · Wed, 24 Jan 2018 21:03:40 +0100

ceph osd pool application enable XXX rbd

-----Original Message-----
From: Steven Vacaroaia [mailto:stef97@xxxxxxxxx] 
Sent: woensdag 24 januari 2018 19:47
To: David Turner
Cc: ceph-users
Subject: Re:  Luminous - bad performance

Hi ,

I have bundled the public NICs and added 2 more monitors ( running on 2 
of the 3 OSD hosts) This seem to improve  things but still I have high 
latency Also performance of the SSD pool is worse than HDD which is very 
confusing 

SSDpool is using one Toshiba PX05SMB040Y per server ( for a total of 3 
OSDs) while HDD pool is using 2 Seagate ST600MM0006 disks per server () 
for a total of 6 OSDs)

Note
I have also disabled  C state in the BIOS and added  
"intel_pstate=disable intel_idle.max_cstate=0 processor.max_cstate=0 
idle=poll" to GRUB

Any hints/suggestions will be greatly appreciated 

[root@osd04 ~]# ceph status
  cluster:
    id:     37161a51-a159-4895-a7fd-3b0d857f1b66
    health: HEALTH_WARN
            noscrub,nodeep-scrub flag(s) set
            application not enabled on 2 pool(s)
            mon osd02 is low on available space

  services:
    mon:         3 daemons, quorum osd01,osd02,mon01
    mgr:         mon01(active)
    osd:         9 osds: 9 up, 9 in
                 flags noscrub,nodeep-scrub
    tcmu-runner: 6 daemons active

  data:
    pools:   2 pools, 228 pgs
    objects: 50384 objects, 196 GB
    usage:   402 GB used, 3504 GB / 3906 GB avail
    pgs:     228 active+clean

  io:
    client:   46061 kB/s rd, 852 B/s wr, 15 op/s rd, 0 op/s wr

[root@osd04 ~]# ceph osd tree
ID  CLASS WEIGHT  TYPE NAME          STATUS REWEIGHT PRI-AFF
 -9       4.50000 root ssds
-10       1.50000     host osd01-ssd
  6   hdd 1.50000         osd.6          up  1.00000 1.00000
-11       1.50000     host osd02-ssd
  7   hdd 1.50000         osd.7          up  1.00000 1.00000
-12       1.50000     host osd04-ssd
  8   hdd 1.50000         osd.8          up  1.00000 1.00000
 -1       2.72574 root default
 -3       1.09058     host osd01
  0   hdd 0.54529         osd.0          up  1.00000 1.00000
  4   hdd 0.54529         osd.4          up  1.00000 1.00000
 -5       1.09058     host osd02
  1   hdd 0.54529         osd.1          up  1.00000 1.00000
  3   hdd 0.54529         osd.3          up  1.00000 1.00000
 -7       0.54459     host osd04
  2   hdd 0.27229         osd.2          up  1.00000 1.00000
  5   hdd 0.27229         osd.5          up  1.00000 1.00000

 rados bench -p ssdpool 300 -t 32 write --no-cleanup && rados bench -p 
ssdpool 300 -t 32  seq

Total time run:         302.058832
Total writes made:      4100
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     54.2941
Stddev Bandwidth:       70.3355
Max bandwidth (MB/sec): 252
Min bandwidth (MB/sec): 0
Average IOPS:           13
Stddev IOPS:            17
Max IOPS:               63
Min IOPS:               0
Average Latency(s):     2.35655
Stddev Latency(s):      4.4346
Max latency(s):         29.7027
Min latency(s):         0.045166

rados bench -p rbd 300 -t 32 write --no-cleanup && rados bench -p rbd 
300 -t 32  seq
Total time run:         301.428571
Total writes made:      8753
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     116.154
Stddev Bandwidth:       71.5763
Max bandwidth (MB/sec): 320
Min bandwidth (MB/sec): 0
Average IOPS:           29
Stddev IOPS:            17
Max IOPS:               80
Min IOPS:               0
Average Latency(s):     1.10189
Stddev Latency(s):      1.80203
Max latency(s):         15.0715
Min latency(s):         0.0210309

[root@osd04 ~]# ethtool -k gth0
Features for gth0:
rx-checksumming: on
tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: on [fixed]
        tx-checksum-sctp: on
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: on
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: on [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off
hw-tc-offload: off
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]

On 22 January 2018 at 12:09, Steven Vacaroaia <stef97@xxxxxxxxx> wrote:

	Hi David,

	I noticed the public interface of the server I am running the test 
from is heavily used  so I will bond that one too 

	I doubt though that this explains the poor performance

	Thanks for your advice 

	Steven

	On 22 January 2018 at 12:02, David Turner <drakonstein@xxxxxxxxx> 
wrote:

		I'm not speaking to anything other than your configuration. 

		"I am using 2 x 10 GB bonded ( BONDING_OPTS="mode=4 miimon=100 
xmit_hash_policy=1 lacp_rate=1")  for cluster and 1 x 1GB for public"
		It might not be a bad idea for you to forgo the public network 
on the 1Gb interfaces and either put everything on one network or use 
VLANs on the 10Gb connections.  I lean more towards that in particular 
because your public network doesn't have a bond on it.  Just as a note, 
communication between the OSDs and the MONs are all done on the public 
network.  If that interface goes down, then the OSDs are likely to be 
marked down/out from your cluster.  I'm a fan of VLANs, but if you don't 
have the equipment or expertise to go that route, then just using the 
same subnet for public and private is a decent way to go.

		On Mon, Jan 22, 2018 at 11:37 AM Steven Vacaroaia 
<stef97@xxxxxxxxx> wrote:

			I did test with rados bench ..here are the results

			rados bench -p ssdpool 300 -t 12 write --no-cleanup && 
rados bench -p ssdpool 300 -t 12  seq

			Total time run:         300.322608
			Total writes made:      10632
			Write size:             4194304
			Object size:            4194304
			Bandwidth (MB/sec):     141.608
			Stddev Bandwidth:       74.1065
			Max bandwidth (MB/sec): 264
			Min bandwidth (MB/sec): 0
			Average IOPS:           35
			Stddev IOPS:            18
			Max IOPS:               66
			Min IOPS:               0
			Average Latency(s):     0.33887
			Stddev Latency(s):      0.701947
			Max latency(s):         9.80161
			Min latency(s):         0.015171

			Total time run:       300.829945
			Total reads made:     10070
			Read size:            4194304
			Object size:          4194304
			Bandwidth (MB/sec):   133.896
			Average IOPS:         33
			Stddev IOPS:          14
			Max IOPS:             68
			Min IOPS:             3
			Average Latency(s):   0.35791
			Max latency(s):       4.68213
			Min latency(s):       0.0107572

			rados bench -p scbench256 300 -t 12 write --no-cleanup && 
rados bench -p scbench256 300 -t 12  seq

			Total time run:         300.747004
			Total writes made:      10239
			Write size:             4194304
			Object size:            4194304
			Bandwidth (MB/sec):     136.181
			Stddev Bandwidth:       75.5
			Max bandwidth (MB/sec): 272
			Min bandwidth (MB/sec): 0
			Average IOPS:           34
			Stddev IOPS:            18
			Max IOPS:               68
			Min IOPS:               0
			Average Latency(s):     0.352339
			Stddev Latency(s):      0.72211
			Max latency(s):         9.62304
			Min latency(s):         0.00936316
			hints = 1

			Total time run:       300.610761
			Total reads made:     7628
			Read size:            4194304
			Object size:          4194304
			Bandwidth (MB/sec):   101.5
			Average IOPS:         25
			Stddev IOPS:          11
			Max IOPS:             61
			Min IOPS:             0
			Average Latency(s):   0.472321
			Max latency(s):       15.636
			Min latency(s):       0.0188098

			On 22 January 2018 at 11:34, Steven Vacaroaia 
<stef97@xxxxxxxxx> wrote:

				sorry ..send the message too soon
				Here is more info
				Vendor Id          : SEAGATE
				                Product Id         : ST600MM0006
				                State              : Online
				                Disk Type          : SAS,Hard Disk 
Device
				                Capacity           : 558.375 GB
				                Power State        : Active

				( SSD is in slot 0)

				 megacli -LDGetProp  -Cache -LALL -a0

				Adapter 0-VD 0(target id: 0): Cache 
Policy:WriteThrough, ReadAheadNone, Direct, No Write Cache if bad BBU
				Adapter 0-VD 1(target id: 1): Cache 
Policy:WriteBack, ReadAdaptive, Direct, No Write Cache if bad BBU
				Adapter 0-VD 2(target id: 2): Cache 
Policy:WriteBack, ReadAdaptive, Direct, No Write Cache if bad BBU
				Adapter 0-VD 3(target id: 3): Cache 
Policy:WriteBack, ReadAdaptive, Direct, No Write Cache if bad BBU
				Adapter 0-VD 4(target id: 4): Cache 
Policy:WriteBack, ReadAdaptive, Direct, No Write Cache if bad BBU
				Adapter 0-VD 5(target id: 5): Cache 
Policy:WriteBack, ReadAdaptive, Direct, No Write Cache if bad BBU

				[root@osd01 ~]#  megacli -LDGetProp  -DskCache -LALL 
-a0

				Adapter 0-VD 0(target id: 0): Disk Write Cache : 
Disabled
				Adapter 0-VD 1(target id: 1): Disk Write Cache : 
Disk's Default
				Adapter 0-VD 2(target id: 2): Disk Write Cache : 
Disk's Default
				Adapter 0-VD 3(target id: 3): Disk Write Cache : 
Disk's Default
				Adapter 0-VD 4(target id: 4): Disk Write Cache : 
Disk's Default
				Adapter 0-VD 5(target id: 5): Disk Write Cache : 
Disk's Default

				CPU
				Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz

				Centos 7 kernel 3.10.0-693.11.6.el7.x86_64

				sysctl -p
				net.ipv4.tcp_sack = 0
				net.core.netdev_budget = 600
				net.ipv4.tcp_window_scaling = 1
				net.core.rmem_max = 16777216
				net.core.wmem_max = 16777216
				net.core.rmem_default = 16777216
				net.core.wmem_default = 16777216
				net.core.optmem_max = 40960
				net.ipv4.tcp_rmem = 4096 87380 16777216
				net.ipv4.tcp_wmem = 4096 65536 16777216
				net.ipv4.tcp_syncookies = 0
				net.core.somaxconn = 1024
				net.core.netdev_max_backlog = 20000
				net.ipv4.tcp_max_syn_backlog = 30000
				net.ipv4.tcp_max_tw_buckets = 2000000
				net.ipv4.tcp_tw_reuse = 1
				net.ipv4.tcp_slow_start_after_idle = 0
				net.ipv4.conf.all.send_redirects = 0
				net.ipv4.conf.all.accept_redirects = 0
				net.ipv4.conf.all.accept_source_route = 0
				vm.min_free_kbytes = 262144
				vm.swappiness = 0
				vm.vfs_cache_pressure = 100
				fs.suid_dumpable = 0
				kernel.core_uses_pid = 1
				kernel.msgmax = 65536
				kernel.msgmnb = 65536
				kernel.randomize_va_space = 1
				kernel.sysrq = 0
				kernel.pid_max = 4194304
				fs.file-max = 100000

				ceph.conf

				public_network = 10.10.30.0/24
				cluster_network = 192.168.0.0/24

				osd_op_num_threads_per_shard = 2
				osd_op_num_shards = 25
				osd_pool_default_size = 2
				osd_pool_default_min_size = 1 # Allow writing 1 copy 
in a degraded state
				osd_pool_default_pg_num = 256
				osd_pool_default_pgp_num = 256
				osd_crush_chooseleaf_type = 1
				osd_scrub_load_threshold = 0.01
				osd_scrub_min_interval = 137438953472
				osd_scrub_max_interval = 137438953472
				osd_deep_scrub_interval = 137438953472
				osd_max_scrubs = 16
				osd_op_threads = 8
				osd_max_backfills = 1
				osd_recovery_max_active = 1
				osd_recovery_op_priority = 1

				debug_lockdep = 0/0
				debug_context = 0/0
				debug_crush = 0/0
				debug_buffer = 0/0
				debug_timer = 0/0
				debug_filer = 0/0
				debug_objecter = 0/0
				debug_rados = 0/0
				debug_rbd = 0/0
				debug_journaler = 0/0
				debug_objectcatcher = 0/0
				debug_client = 0/0
				debug_osd = 0/0
				debug_optracker = 0/0
				debug_objclass = 0/0
				debug_filestore = 0/0
				debug_journal = 0/0
				debug_ms = 0/0
				debug_monc = 0/0
				debug_tp = 0/0
				debug_auth = 0/0
				debug_finisher = 0/0
				debug_heartbeatmap = 0/0
				debug_perfcounter = 0/0
				debug_asok = 0/0
				debug_throttle = 0/0
				debug_mon = 0/0
				debug_paxos = 0/0
				debug_rgw = 0/0

				[mon]
				mon_allow_pool_delete = true

				[osd]
				osd_heartbeat_grace = 20
				osd_heartbeat_interval = 5
				bluestore_block_db_size = 16106127360 
<tel:(610)%20612-7360> 
				bluestore_block_wal_size = 1073741824

				[osd.6]
				host = osd01
				osd_journal = 
/dev/disk/by-parttypeuuid/4fbd7e29-9d25-41b8-afd0-062c0ceff05d.1d58775a-
5019-42ea-8149-a126f51a2501
				crush_location = root=ssds host=osd01-ssd

				[osd.7]
				host = osd02
				osd_journal = 
/dev/disk/by-parttypeuuid/4fbd7e29-9d25-41b8-afd0-062c0ceff05d.683dc52d-
5d69-4ff0-b5d9-b17056a55681
				crush_location = root=ssds host=osd02-ssd

				[osd.8]
				host = osd04
				osd_journal = 
/dev/disk/by-parttypeuuid/4fbd7e29-9d25-41b8-afd0-062c0ceff05d.bd7c0088-
b724-441e-9b88-9457305c541d
				crush_location = root=ssds host=osd04-ssd

				On 22 January 2018 at 11:29, Steven Vacaroaia 
<stef97@xxxxxxxxx> wrote:

					Hi David,

					Yes, I meant no separate partitions for WAL and 
DB

					I am using 2 x 10 GB bonded ( 
BONDING_OPTS="mode=4 miimon=100 xmit_hash_policy=1 lacp_rate=1")  for 
cluster and 1 x 1GB for public   
					Disks are 
					Vendor Id          : TOSHIBA
					                Product Id         : 
PX05SMB040Y
					                State              : Online
					                Disk Type          : SAS,Solid 
State Device
					                Capacity           : 372.0 GB

					On 22 January 2018 at 11:24, David Turner 
<drakonstein@xxxxxxxxx> wrote:

						Disk models, other hardware information 
including CPU, network config?  You say you're using Luminous, but then 
say journal on same device.  I'm assuming you mean that you just have 
the bluestore OSD configured without a separate WAL or DB partition?  
Any more specifics you can give will be helpful.

						On Mon, Jan 22, 2018 at 11:20 AM Steven 
Vacaroaia <stef97@xxxxxxxxx> wrote:

							Hi,

							I'll appreciate if you can provide 
some guidance / suggestions regarding perfomance issues on a test 
cluster ( 3 x DELL R620, 1 Entreprise SSD, 3 x 600 GB ,Entreprise HDD, 8 
cores, 64 GB RAM)

							I created 2 pools ( replication 
factor 2) one with only SSD and the other with only HDD
							( journal on same disk for both)

							The perfomance is quite similar 
although I was expecting to be at least 5 times better
							No issues noticed using atop

							What  should I check / tune ?

							Many thanks
							Steven

							HDD based pool ( journal on the same 
disk)

							ceph osd pool get scbench256 all

							size: 2
							min_size: 1
							crash_replay_interval: 0
							pg_num: 256
							pgp_num: 256
							crush_rule: replicated_rule
							hashpspool: true
							nodelete: false
							nopgchange: false
							nosizechange: false
							write_fadvise_dontneed: false
							noscrub: false
							nodeep-scrub: false
							use_gmt_hitset: 1
							auid: 0
							fast_read: 0

							rbd bench --io-type write  image1 
--pool=scbench256
							bench  type write io_size 4096 
io_threads 16 bytes 1073741824 pattern sequential
							  SEC       OPS   OPS/SEC   BYTES/SEC
							    1     46816  46836.46  
191842139.78
							    2     90658  45339.11  
185709011.80
							    3    133671  44540.80  
182439126.08
							    4    177341  44340.36  
181618100.14
							    5    217300  43464.04  
178028704.54
							    6    259595  42555.85  
174308767.05
							elapsed:     6  ops:   262144  
ops/sec: 42694.50  bytes/sec: 174876688.23

							fio /home/cephuser/write_256.fio
							write-4M: (g=0): rw=randread, 
bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32
							fio-2.2.8
							Starting 1 process
							rbd engine: RBD version: 1.12.0
							Jobs: 1 (f=1): [r(1)] [100.0% done] 
[66284KB/0KB/0KB /s] [16.6K/0/0 iops] [eta 00m:00s]

							fio /home/cephuser/write_256.fio
							write-4M: (g=0): rw=write, 
bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32
							fio-2.2.8
							Starting 1 process
							rbd engine: RBD version: 1.12.0
							Jobs: 1 (f=1): [W(1)] [100.0% done] 
[0KB/14464KB/0KB /s] [0/3616/0 iops] [eta 00m:00s]

							SSD based pool 

							ceph osd pool get ssdpool all

							size: 2
							min_size: 1
							crash_replay_interval: 0
							pg_num: 128
							pgp_num: 128
							crush_rule: ssdpool
							hashpspool: true
							nodelete: false
							nopgchange: false
							nosizechange: false
							write_fadvise_dontneed: false
							noscrub: false
							nodeep-scrub: false
							use_gmt_hitset: 1
							auid: 0
							fast_read: 0

							 rbd -p ssdpool create --size 52100 
image2

							rbd bench --io-type write  image2 
--pool=ssdpool
							bench  type write io_size 4096 
io_threads 16 bytes 1073741824 pattern sequential
							  SEC       OPS   OPS/SEC   BYTES/SEC
							    1     42412  41867.57  
171489557.93
							    2     78343  39180.86  
160484805.88
							    3    118082  39076.48  
160057256.16
							    4    155164  38683.98  
158449572.38
							    5    192825  38307.59  
156907885.84
							    6    230701  37716.95  
154488608.16
							elapsed:     7  ops:   262144  
ops/sec: 36862.89  bytes/sec: 150990387.29

							[root@osd01 ~]# fio 
/home/cephuser/write_256.fio
							write-4M: (g=0): rw=write, 
bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32
							fio-2.2.8
							Starting 1 process
							rbd engine: RBD version: 1.12.0
							Jobs: 1 (f=1): [W(1)] [100.0% done] 
[0KB/20224KB/0KB /s] [0/5056/0 iops] [eta 00m:00s]

							fio /home/cephuser/write_256.fio
							write-4M: (g=0): rw=randread, 
bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32
							fio-2.2.8
							Starting 1 process
							rbd engine: RBD version: 1.12.0
							Jobs: 1 (f=1): [r(1)] [100.0% done] 
[76096KB/0KB/0KB /s] [19.3K/0/0 iops] [eta 00m:00s]

_______________________________________________
							ceph-users mailing list
							ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com