Re: Luminous - bad performance

Steven Vacaroaia <stef97@xxxxxxxxx> · Thu, 25 Jan 2018 10:12:34 -0500

Hi, setting the pplication pool helped - the performance is not skewed anymore ( i.e. SSDpool is better than HDD)
However latency when using more threads is still very high 

I am getting 9.91 Gbits/sec when testing with iperf 

Not sure what else should I check  
As always, your help will be greatly appreciated

using 1 hread ( -t 1)

Total writes made:      1627
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     108.387
Stddev Bandwidth:       9.75056
Max bandwidth (MB/sec): 128
Min bandwidth (MB/sec): 92
Average IOPS:           27
Stddev IOPS:            2
Max IOPS:               32
Min IOPS:               23
Average Latency(s):     0.0369025
Stddev Latency(s):      0.0161718
Max latency(s):         0.258894
Min latency(s):         0.0133281

using 32 threads ( -t 32 )

Total writes made:      2348
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     143.244
Stddev Bandwidth:       124.265
Max bandwidth (MB/sec): 420
Min bandwidth (MB/sec): 0
Average IOPS:           35
Stddev IOPS:            31
Max IOPS:               105
Min IOPS:               0
Average Latency(s):     0.892837
Stddev Latency(s):      1.97054
Max latency(s):         14.0602
Min latency(s):         0.0250363

On 24 January 2018 at 15:03, Marc Roos <M.Roos@xxxxxxxxxxxxxxxxx> wrote:

ceph osd pool application enable XXX rbd

-----Original Message-----

From: Steven Vacaroaia [mailto:stef97@xxxxxxxxx]

Sent: woensdag 24 januari 2018 19:47

To: David Turner

Cc: ceph-users

Subject: Re: [ceph-users] Luminous - bad performance

Hi ,

I have bundled the public NICs and added 2 more monitors ( running on 2

of the 3 OSD hosts) This seem to improve  things but still I have high

latency Also performance of the SSD pool is worse than HDD which is very

confusing

SSDpool is using one Toshiba PX05SMB040Y per server ( for a total of 3

OSDs) while HDD pool is using 2 Seagate ST600MM0006 disks per server ()

for a total of 6 OSDs)

Note

I have also disabled  C state in the BIOS and added

"intel_pstate=disable intel_idle.max_cstate=0 processor.max_cstate=0

idle=poll" to GRUB

Any hints/suggestions will be greatly appreciated

[root@osd04 ~]# ceph status

  cluster:

    id:     37161a51-a159-4895-a7fd-3b0d857f1b66

    health: HEALTH_WARN

            noscrub,nodeep-scrub flag(s) set

            application not enabled on 2 pool(s)

            mon osd02 is low on available space

  services:

    mon:         3 daemons, quorum osd01,osd02,mon01

    mgr:         mon01(active)

    osd:         9 osds: 9 up, 9 in

                 flags noscrub,nodeep-scrub

    tcmu-runner: 6 daemons active

  data:

    pools:   2 pools, 228 pgs

    objects: 50384 objects, 196 GB

    usage:   402 GB used, 3504 GB / 3906 GB avail

    pgs:     228 active+clean

  io:

    client:   46061 kB/s rd, 852 B/s wr, 15 op/s rd, 0 op/s wr

[root@osd04 ~]# ceph osd tree

ID  CLASS WEIGHT  TYPE NAME          STATUS REWEIGHT PRI-AFF

 -9       4.50000 root ssds

-10       1.50000     host osd01-ssd

  6   hdd 1.50000         osd.6          up  1.00000 1.00000

-11       1.50000     host osd02-ssd

  7   hdd 1.50000         osd.7          up  1.00000 1.00000

-12       1.50000     host osd04-ssd

  8   hdd 1.50000         osd.8          up  1.00000 1.00000

 -1       2.72574 root default

 -3       1.09058     host osd01

  0   hdd 0.54529         osd.0          up  1.00000 1.00000

  4   hdd 0.54529         osd.4          up  1.00000 1.00000

 -5       1.09058     host osd02

  1   hdd 0.54529         osd.1          up  1.00000 1.00000

  3   hdd 0.54529         osd.3          up  1.00000 1.00000

 -7       0.54459     host osd04

  2   hdd 0.27229         osd.2          up  1.00000 1.00000

  5   hdd 0.27229         osd.5          up  1.00000 1.00000

 rados bench -p ssdpool 300 -t 32 write --no-cleanup && rados bench -p

ssdpool 300 -t 32  seq

Total time run:         302.058832

Total writes made:      4100

Write size:             4194304

Object size:            4194304

Bandwidth (MB/sec):     54.2941

Stddev Bandwidth:       70.3355

Max bandwidth (MB/sec): 252

Min bandwidth (MB/sec): 0

Average IOPS:           13

Stddev IOPS:            17

Max IOPS:               63

Min IOPS:               0

Average Latency(s):     2.35655

Stddev Latency(s):      4.4346

Max latency(s):         29.7027

Min latency(s):         0.045166

rados bench -p rbd 300 -t 32 write --no-cleanup && rados bench -p rbd

300 -t 32  seq

Total time run:         301.428571

Total writes made:      8753

Write size:             4194304

Object size:            4194304

Bandwidth (MB/sec):     116.154

Stddev Bandwidth:       71.5763

Max bandwidth (MB/sec): 320

Min bandwidth (MB/sec): 0

Average IOPS:           29

Stddev IOPS:            17

Max IOPS:               80

Min IOPS:               0

Average Latency(s):     1.10189

Stddev Latency(s):      1.80203

Max latency(s):         15.0715

Min latency(s):         0.0210309

[root@osd04 ~]# ethtool -k gth0

Features for gth0:

rx-checksumming: on

tx-checksumming: on

        tx-checksum-ipv4: off [fixed]

        tx-checksum-ip-generic: on

        tx-checksum-ipv6: off [fixed]

        tx-checksum-fcoe-crc: on [fixed]

        tx-checksum-sctp: on

scatter-gather: on

        tx-scatter-gather: on

        tx-scatter-gather-fraglist: off [fixed]

tcp-segmentation-offload: on

        tx-tcp-segmentation: on

        tx-tcp-ecn-segmentation: off [fixed]

        tx-tcp-mangleid-segmentation: off

        tx-tcp6-segmentation: on

udp-fragmentation-offload: off [fixed]

generic-segmentation-offload: on

generic-receive-offload: on

large-receive-offload: off

rx-vlan-offload: on

tx-vlan-offload: on

ntuple-filters: off

receive-hashing: on

highdma: on [fixed]

rx-vlan-filter: on

vlan-challenged: off [fixed]

tx-lockless: off [fixed]

netns-local: off [fixed]

tx-gso-robust: off [fixed]

tx-fcoe-segmentation: on [fixed]

tx-gre-segmentation: on

tx-gre-csum-segmentation: on

tx-ipxip4-segmentation: on

tx-ipxip6-segmentation: on

tx-udp_tnl-segmentation: on

tx-udp_tnl-csum-segmentation: on

tx-gso-partial: on

tx-sctp-segmentation: off [fixed]

tx-esp-segmentation: off [fixed]

fcoe-mtu: off [fixed]

tx-nocache-copy: off

loopback: off [fixed]

rx-fcs: off [fixed]

rx-all: off

tx-vlan-stag-hw-insert: off [fixed]

rx-vlan-stag-hw-parse: off [fixed]

rx-vlan-stag-filter: off [fixed]

l2-fwd-offload: off

hw-tc-offload: off

esp-hw-offload: off [fixed]

esp-tx-csum-hw-offload: off [fixed]

On 22 January 2018 at 12:09, Steven Vacaroaia <stef97@xxxxxxxxx> wrote:

        Hi David,

        I noticed the public interface of the server I am running the test

from is heavily used  so I will bond that one too

        I doubt though that this explains the poor performance

        Thanks for your advice

        Steven

        On 22 January 2018 at 12:02, David Turner <drakonstein@xxxxxxxxx>

wrote:

                I'm not speaking to anything other than your configuration.

                "I am using 2 x 10 GB bonded ( BONDING_OPTS="mode=4 miimon=100

xmit_hash_policy=1 lacp_rate=1")  for cluster and 1 x 1GB for public"

                It might not be a bad idea for you to forgo the public network

on the 1Gb interfaces and either put everything on one network or use

VLANs on the 10Gb connections.  I lean more towards that in particular

because your public network doesn't have a bond on it.  Just as a note,

communication between the OSDs and the MONs are all done on the public

network.  If that interface goes down, then the OSDs are likely to be

marked down/out from your cluster.  I'm a fan of VLANs, but if you don't

have the equipment or expertise to go that route, then just using the

same subnet for public and private is a decent way to go.

                On Mon, Jan 22, 2018 at 11:37 AM Steven Vacaroaia

<stef97@xxxxxxxxx> wrote:

                        I did test with rados bench ..here are the results

                        rados bench -p ssdpool 300 -t 12 write --no-cleanup &&

rados bench -p ssdpool 300 -t 12  seq

                        Total time run:         300.322608

                        Total writes made:      10632

                        Write size:             4194304

                        Object size:            4194304

                        Bandwidth (MB/sec):     141.608

                        Stddev Bandwidth:       74.1065

                        Max bandwidth (MB/sec): 264

                        Min bandwidth (MB/sec): 0

                        Average IOPS:           35

                        Stddev IOPS:            18

                        Max IOPS:               66

                        Min IOPS:               0

                        Average Latency(s):     0.33887

                        Stddev Latency(s):      0.701947

                        Max latency(s):         9.80161

                        Min latency(s):         0.015171

                        Total time run:       300.829945

                        Total reads made:     10070

                        Read size:            4194304

                        Object size:          4194304

                        Bandwidth (MB/sec):   133.896

                        Average IOPS:         33

                        Stddev IOPS:          14

                        Max IOPS:             68

                        Min IOPS:             3

                        Average Latency(s):   0.35791

                        Max latency(s):       4.68213

                        Min latency(s):       0.0107572

                        rados bench -p scbench256 300 -t 12 write --no-cleanup &&

rados bench -p scbench256 300 -t 12  seq

                        Total time run:         300.747004

                        Total writes made:      10239

                        Write size:             4194304

                        Object size:            4194304

                        Bandwidth (MB/sec):     136.181

                        Stddev Bandwidth:       75.5

                        Max bandwidth (MB/sec): 272

                        Min bandwidth (MB/sec): 0

                        Average IOPS:           34

                        Stddev IOPS:            18

                        Max IOPS:               68

                        Min IOPS:               0

                        Average Latency(s):     0.352339

                        Stddev Latency(s):      0.72211

                        Max latency(s):         9.62304

                        Min latency(s):         0.00936316

                        hints = 1

                        Total time run:       300.610761

                        Total reads made:     7628

                        Read size:            4194304

                        Object size:          4194304

                        Bandwidth (MB/sec):   101.5

                        Average IOPS:         25

                        Stddev IOPS:          11

                        Max IOPS:             61

                        Min IOPS:             0

                        Average Latency(s):   0.472321

                        Max latency(s):       15.636

                        Min latency(s):       0.0188098

                        On 22 January 2018 at 11:34, Steven Vacaroaia

<stef97@xxxxxxxxx> wrote:

                                sorry ..send the message too soon

                                Here is more info

                                Vendor Id          : SEAGATE

                                                Product Id         : ST600MM0006

                                                State              : Online

                                                Disk Type          : SAS,Hard Disk

Device

                                                Capacity           : 558.375 GB

                                                Power State        : Active

                                ( SSD is in slot 0)

                                 megacli -LDGetProp  -Cache -LALL -a0

                                Adapter 0-VD 0(target id: 0): Cache

Policy:WriteThrough, ReadAheadNone, Direct, No Write Cache if bad BBU

                                Adapter 0-VD 1(target id: 1): Cache

Policy:WriteBack, ReadAdaptive, Direct, No Write Cache if bad BBU

                                Adapter 0-VD 2(target id: 2): Cache

Policy:WriteBack, ReadAdaptive, Direct, No Write Cache if bad BBU

                                Adapter 0-VD 3(target id: 3): Cache

Policy:WriteBack, ReadAdaptive, Direct, No Write Cache if bad BBU

                                Adapter 0-VD 4(target id: 4): Cache

Policy:WriteBack, ReadAdaptive, Direct, No Write Cache if bad BBU

                                Adapter 0-VD 5(target id: 5): Cache

Policy:WriteBack, ReadAdaptive, Direct, No Write Cache if bad BBU

                                [root@osd01 ~]#  megacli -LDGetProp  -DskCache -LALL

-a0

                                Adapter 0-VD 0(target id: 0): Disk Write Cache :

Disabled

                                Adapter 0-VD 1(target id: 1): Disk Write Cache :

Disk's Default

                                Adapter 0-VD 2(target id: 2): Disk Write Cache :

Disk's Default

                                Adapter 0-VD 3(target id: 3): Disk Write Cache :

Disk's Default

                                Adapter 0-VD 4(target id: 4): Disk Write Cache :

Disk's Default

                                Adapter 0-VD 5(target id: 5): Disk Write Cache :

Disk's Default

                                CPU

                                Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz

                                Centos 7 kernel 3.10.0-693.11.6.el7.x86_64

                                sysctl -p

                                net.ipv4.tcp_sack = 0

                                net.core.netdev_budget = 600

                                net.ipv4.tcp_window_scaling = 1

                                net.core.rmem_max = 16777216

                                net.core.wmem_max = 16777216

                                net.core.rmem_default = 16777216

                                net.core.wmem_default = 16777216

                                net.core.optmem_max = 40960

                                net.ipv4.tcp_rmem = 4096 87380 16777216

                                net.ipv4.tcp_wmem = 4096 65536 16777216

                                net.ipv4.tcp_syncookies = 0

                                net.core.somaxconn = 1024

                                net.core.netdev_max_backlog = 20000

                                net.ipv4.tcp_max_syn_backlog = 30000

                                net.ipv4.tcp_max_tw_buckets = 2000000

                                net.ipv4.tcp_tw_reuse = 1

                                net.ipv4.tcp_slow_start_after_idle = 0

                                net.ipv4.conf.all.send_redirects = 0

                                net.ipv4.conf.all.accept_redirects = 0

                                net.ipv4.conf.all.accept_source_route = 0

                                vm.min_free_kbytes = 262144

                                vm.swappiness = 0

                                vm.vfs_cache_pressure = 100

                                fs.suid_dumpable = 0

                                kernel.core_uses_pid = 1

                                kernel.msgmax = 65536

                                kernel.msgmnb = 65536

                                kernel.randomize_va_space = 1

                                kernel.sysrq = 0

                                kernel.pid_max = 4194304

                                fs.file-max = 100000

                                ceph.conf

                                public_network = 10.10.30.0/24

                                cluster_network = 192.168.0.0/24

                                osd_op_num_threads_per_shard = 2

                                osd_op_num_shards = 25

                                osd_pool_default_size = 2

                                osd_pool_default_min_size = 1 # Allow writing 1 copy

in a degraded state

                                osd_pool_default_pg_num = 256

                                osd_pool_default_pgp_num = 256

                                osd_crush_chooseleaf_type = 1

                                osd_scrub_load_threshold = 0.01

                                osd_scrub_min_interval = 137438953472

                                osd_scrub_max_interval = 137438953472

                                osd_deep_scrub_interval = 137438953472

                                osd_max_scrubs = 16

                                osd_op_threads = 8

                                osd_max_backfills = 1

                                osd_recovery_max_active = 1

                                osd_recovery_op_priority = 1

                                debug_lockdep = 0/0

                                debug_context = 0/0

                                debug_crush = 0/0

                                debug_buffer = 0/0

                                debug_timer = 0/0

                                debug_filer = 0/0

                                debug_objecter = 0/0

                                debug_rados = 0/0

                                debug_rbd = 0/0

                                debug_journaler = 0/0

                                debug_objectcatcher = 0/0

                                debug_client = 0/0

                                debug_osd = 0/0

                                debug_optracker = 0/0

                                debug_objclass = 0/0

                                debug_filestore = 0/0

                                debug_journal = 0/0

                                debug_ms = 0/0

                                debug_monc = 0/0

                                debug_tp = 0/0

                                debug_auth = 0/0

                                debug_finisher = 0/0

                                debug_heartbeatmap = 0/0

                                debug_perfcounter = 0/0

                                debug_asok = 0/0

                                debug_throttle = 0/0

                                debug_mon = 0/0

                                debug_paxos = 0/0

                                debug_rgw = 0/0

                                [mon]

                                mon_allow_pool_delete = true

                                [osd]

                                osd_heartbeat_grace = 20

                                osd_heartbeat_interval = 5

                                bluestore_block_db_size = 16106127360

<tel:(610)%20612-7360>

                                bluestore_block_wal_size = 1073741824

                                [osd.6]

                                host = osd01

                                osd_journal =

/dev/disk/by-parttypeuuid/4fbd7e29-9d25-41b8-afd0-062c0ceff05d.1d58775a-

5019-42ea-8149-a126f51a2501

                                crush_location = root=ssds host=osd01-ssd

                                [osd.7]

                                host = osd02

                                osd_journal =

/dev/disk/by-parttypeuuid/4fbd7e29-9d25-41b8-afd0-062c0ceff05d.683dc52d-

5d69-4ff0-b5d9-b17056a55681

                                crush_location = root=ssds host=osd02-ssd

                                [osd.8]

                                host = osd04

                                osd_journal =

/dev/disk/by-parttypeuuid/4fbd7e29-9d25-41b8-afd0-062c0ceff05d.bd7c0088-

b724-441e-9b88-9457305c541d

                                crush_location = root=ssds host=osd04-ssd

                                On 22 January 2018 at 11:29, Steven Vacaroaia

<stef97@xxxxxxxxx> wrote:

                                        Hi David,

                                        Yes, I meant no separate partitions for WAL and

DB

                                        I am using 2 x 10 GB bonded (

BONDING_OPTS="mode=4 miimon=100 xmit_hash_policy=1 lacp_rate=1")  for

cluster and 1 x 1GB for public

                                        Disks are

                                        Vendor Id          : TOSHIBA

                                                        Product Id         :

PX05SMB040Y

                                                        State              : Online

                                                        Disk Type          : SAS,Solid

State Device

                                                        Capacity           : 372.0 GB

                                        On 22 January 2018 at 11:24, David Turner

<drakonstein@xxxxxxxxx> wrote:

                                                Disk models, other hardware information

including CPU, network config?  You say you're using Luminous, but then

say journal on same device.  I'm assuming you mean that you just have

the bluestore OSD configured without a separate WAL or DB partition?

Any more specifics you can give will be helpful.

                                                On Mon, Jan 22, 2018 at 11:20 AM Steven

Vacaroaia <stef97@xxxxxxxxx> wrote:

                                                        Hi,

                                                        I'll appreciate if you can provide

some guidance / suggestions regarding perfomance issues on a test

cluster ( 3 x DELL R620, 1 Entreprise SSD, 3 x 600 GB ,Entreprise HDD, 8

cores, 64 GB RAM)

                                                        I created 2 pools ( replication

factor 2) one with only SSD and the other with only HDD

                                                        ( journal on same disk for both)

                                                        The perfomance is quite similar

although I was expecting to be at least 5 times better

                                                        No issues noticed using atop

                                                        What  should I check / tune ?

                                                        Many thanks

                                                        Steven

                                                        HDD based pool ( journal on the same

disk)

                                                        ceph osd pool get scbench256 all

                                                        size: 2

                                                        min_size: 1

                                                        crash_replay_interval: 0

                                                        pg_num: 256

                                                        pgp_num: 256

                                                        crush_rule: replicated_rule

                                                        hashpspool: true

                                                        nodelete: false

                                                        nopgchange: false

                                                        nosizechange: false

                                                        write_fadvise_dontneed: false

                                                        noscrub: false

                                                        nodeep-scrub: false

                                                        use_gmt_hitset: 1

                                                        auid: 0

                                                        fast_read: 0

                                                        rbd bench --io-type write  image1

--pool=scbench256

                                                        bench  type write io_size 4096

io_threads 16 bytes 1073741824 pattern sequential

                                                          SEC       OPS   OPS/SEC   BYTES/SEC

                                                            1     46816  46836.46

191842139.78

                                                            2     90658  45339.11

185709011.80

                                                            3    133671  44540.80

182439126.08

                                                            4    177341  44340.36

181618100.14

                                                            5    217300  43464.04

178028704.54

                                                            6    259595  42555.85

174308767.05

                                                        elapsed:     6  ops:   262144

ops/sec: 42694.50  bytes/sec: 174876688.23

                                                        fio /home/cephuser/write_256.fio

                                                        write-4M: (g=0): rw=randread,

bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32

                                                        fio-2.2.8

                                                        Starting 1 process

                                                        rbd engine: RBD version: 1.12.0

                                                        Jobs: 1 (f=1): [r(1)] [100.0% done]

[66284KB/0KB/0KB /s] [16.6K/0/0 iops] [eta 00m:00s]

                                                        fio /home/cephuser/write_256.fio

                                                        write-4M: (g=0): rw=write,

bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32

                                                        fio-2.2.8

                                                        Starting 1 process

                                                        rbd engine: RBD version: 1.12.0

                                                        Jobs: 1 (f=1): [W(1)] [100.0% done]

[0KB/14464KB/0KB /s] [0/3616/0 iops] [eta 00m:00s]

                                                        SSD based pool

                                                        ceph osd pool get ssdpool all

                                                        size: 2

                                                        min_size: 1

                                                        crash_replay_interval: 0

                                                        pg_num: 128

                                                        pgp_num: 128

                                                        crush_rule: ssdpool

                                                        hashpspool: true

                                                        nodelete: false

                                                        nopgchange: false

                                                        nosizechange: false

                                                        write_fadvise_dontneed: false

                                                        noscrub: false

                                                        nodeep-scrub: false

                                                        use_gmt_hitset: 1

                                                        auid: 0

                                                        fast_read: 0

                                                         rbd -p ssdpool create --size 52100

image2

                                                        rbd bench --io-type write  image2

--pool=ssdpool

                                                        bench  type write io_size 4096

io_threads 16 bytes 1073741824 pattern sequential

                                                          SEC       OPS   OPS/SEC   BYTES/SEC

                                                            1     42412  41867.57

171489557.93

                                                            2     78343  39180.86

160484805.88

                                                            3    118082  39076.48

160057256.16

                                                            4    155164  38683.98

158449572.38

                                                            5    192825  38307.59

156907885.84

                                                            6    230701  37716.95

154488608.16

                                                        elapsed:     7  ops:   262144

ops/sec: 36862.89  bytes/sec: 150990387.29

                                                        [root@osd01 ~]# fio

/home/cephuser/write_256.fio

                                                        write-4M: (g=0): rw=write,

bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32

                                                        fio-2.2.8

                                                        Starting 1 process

                                                        rbd engine: RBD version: 1.12.0

                                                        Jobs: 1 (f=1): [W(1)] [100.0% done]

[0KB/20224KB/0KB /s] [0/5056/0 iops] [eta 00m:00s]

                                                        fio /home/cephuser/write_256.fio

                                                        write-4M: (g=0): rw=randread,

bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32

                                                        fio-2.2.8

                                                        Starting 1 process

                                                        rbd engine: RBD version: 1.12.0

                                                        Jobs: 1 (f=1): [r(1)] [100.0% done]

[76096KB/0KB/0KB /s] [19.3K/0/0 iops] [eta 00m:00s]

_______________________________________________

                                                        ceph-users mailing list

                                                        ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com