Re: Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Sat, 07 Jan 2017 09:53:48 +0200

The numbers are very low. I would first benchmark the system without the vm client using rbd 4k test such as:

rbd bench-write image01  --pool=rbd --io-threads=32 --io-size 4096
--io-pattern rand --rbd_cache=false

-------- Original message --------
From: kevin parrikar <kevin.parker092@xxxxxxxxx> 
Date: 07/01/2017  05:48  (GMT+02:00) 
To: Christian Balzer <chibi@xxxxxxx> 
Cc: ceph-users@xxxxxxxxxxxxxx 
Subject: Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release 

i really need some help here :(

replaced all 7.2 rpm SAS disks with new Samsung 840 evo 512Gb SSD with no seperate journal Disk .Now both OSD nodes are with 2 ssd disks  with a replica of 2 . 
Total number of OSD process in the cluster is 4.with all SSD.

But throughput has gone down from 1.4 MB/s to 1.3 MB/s for 4k writes and for 4M it has gone down from 140MB/s to 126MB/s .

now atop no longer shows OSD device as 100% busy..

How ever i can see both ceph-osd process in atop with 53% and 47% disk utilization.

 PID                         RDDSK          WRDSK           WCANCL       DSK     CMD        1/2
20771                          0K                648.8M             0K               53%    ceph-osd
19547                          0K                576.7M             0K               47%    ceph-osd

OSD disks(ssd) utilization from atop

DSK |  sdc | busy  6%  | read  0  | write  517  | KiB/r   0  | KiB/w  293 | MBr/s 0.00  | MBw/s 148.18  | avq   9.44  | avio 0.12 ms  |

DSK |  sdd | busy   5% | read   0 | write   336 | KiB/r   0  | KiB/w   292 | MBr/s 0.00 | MBw/s  96.12  | avq     7.62  | avio 0.15 ms  |

Queue Depth of OSD disks
 cat /sys/block/sdd/device//queue_depth
256

atop inside virtual machine:[4 CPU/3Gb RAM]
DSK |   vdc  | busy     96%  | read     0  | write  256  | KiB/r   0  | KiB/w  512  | MBr/s   0.00  | MBw/s 128.00  | avq    7.96  | avio 3.77 ms  |

Both Guest and Host are using deadline I/O scheduler

Virtual Machine Configuration:

 </disk>
    <disk type='network' device='disk'>
      <driver name='qemu' type='raw' cache='writeback'/>
      <auth username='compute'>
        <secret type='ceph' uuid='a5d0dd94-57c4-ae55-ffe0-7e3732a24455'/>
      </auth>
      <source protocol='rbd' name='volumes/volume-449da0e7-6223-457c-b2c6-b5e112099212'>
        <host name='172.16.1.8' port='6789'/>
        <host name='172.16.1.11' port='6789'/>
        <host name='172.16.1.12' port='6789'/>
      </source>
      <target dev='vdb' bus='virtio'/>
      <serial>449da0e7-6223-457c-b2c6-b5e112099212</serial>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </disk>

ceph.conf

 cat /etc/ceph/ceph.conf

[global]
fsid = c4e1a523-9017-492e-9c30-8350eba1bd51
mon_initial_members = node-16 node-30 node-31
mon_host = 172.16.1.11 172.16.1.12 172.16.1.8
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
log_to_syslog_level = info
log_to_syslog = True
osd_pool_default_size = 2
osd_pool_default_min_size = 1
osd_pool_default_pg_num = 64
public_network = 172.16.1.0/24
log_to_syslog_facility = LOG_LOCAL0
osd_journal_size = 2048
auth_supported = cephx
osd_pool_default_pgp_num = 64
osd_mkfs_type = xfs
cluster_network = 172.16.1.0/24
osd_recovery_max_active = 1
osd_max_backfills = 1

[client]
rbd_cache_writethrough_until_flush = True
rbd_cache = True

[client.radosgw.gateway]
rgw_keystone_accepted_roles = _member_, Member, admin, swiftoperator
keyring = /etc/ceph/keyring.radosgw.gateway
rgw_frontends = fastcgi socket_port=9000 socket_host=127.0.0.1
rgw_socket_path = /tmp/radosgw.sock
rgw_keystone_revocation_interval = 1000000

Any guidance on where to look for issues.

Regards,
Kevin

On Fri, Jan 6, 2017 at 4:42 PM, kevin parrikar <kevin.parker092@xxxxxxxxx> wrote:
Thanks Christian for your valuable comments,each comment is a new learning for me.
Please see inline 

On Fri, Jan 6, 2017 at 9:32 AM, Christian Balzer <chibi@xxxxxxx> wrote:

Hello,

On Fri, 6 Jan 2017 08:40:36 +0530 kevin parrikar wrote:

> Hello All,

>

> I have setup a ceph cluster based on 0.94.6 release in  2 servers each with

> 80Gb intel s3510 and 2x3 Tb 7.2 SATA disks,16 CPU,24G RAM

> which is connected to a 10G switch with a replica of 2 [ i will add 3 more

> servers to the cluster] and 3 seperate monitor nodes which are vms.

>

I'd go to the latest hammer, this version has a lethal cache-tier bug if

you should decide to try that.

80Gb Intel DC S3510 are a) slow and b) have only 0.3 DWPD.

You're going to wear those out quickly and if not replaced in time loose

data.

2 HDDs give you a theoretical speed of something like 300MB/s sustained,

when used a OSDs I'd expect the usual 50-60MB/s per OSD due to

seeks, journal (file system) and leveldb overheads.

Which perfectly matches your results.

Hmmmm that makes sense ,its hitting 7.2 rpm OSD's peak write speed.I was in an assumption that ssd Journal to OSD will happen slowly at a later time and hence  i could use slower and cheaper disks for OSD.But in practise it looks like many articles in the internet that talks about faster journal and slower OSD dont seems to be correct.

Will adding more OSD disks per node improve the overall performance?

 i can add 4 more disks to each node,but all are 7.2 rpm disks .I am expecting some kind of parallel writes on these disks and magically improves performance :D

This is my second experiment with Ceph last time i gave up and purchased another costly solution from a vendor.But this time i am determined to fix all issues and bring up a solid cluster .
Last time clsuter was  giving a throughput of around 900kbps for 1G writes from virtual machine and now things have improved ,its giving 1.4 Mbps but still far slower than the target of 24Mbps.

Expecting to make some progress with the help of experts here :)

> rbd_cache is enabled in configurations,XFS filesystem,LSI 92465-4i raid

> card with 512Mb cache [ssd is in writeback mode wth BBU]

>

>

> Before installing ceph, i tried to check max throughpit of intel 3500  80G

> SSD using block size of 4M [i read somewhere that ceph uses 4m objects] and

> it was giving 220mbps {dd if=/dev/zero of=/dev/sdb bs=4M count=1000

> oflag=direct}

>

Irrelevant, sustained sequential writes will be limited by what your OSDs

(HDDs) can sustain.

> *Observation:*

> Now the cluster is up and running and from the vm i am trying to write a 4g

> file to its volume using dd if=/dev/zero of=/dev/sdb bs=4M count=1000

> oflag=direct .It takes aroud 39 seconds to write.

>

>  during this time ssd journal was showing disk write of 104M on both the

> ceph servers (dstat sdb) and compute node a network transfer rate of ~110M

> on its 10G storage interface(dstat -nN eth2]

>

As I said, sounds about right.

>

> my questions are:

>

>

>    - Is this the best throughput ceph can offer or can anything in my

>    environment be optmised to get  more performance? [iperf shows a max

>    throughput 9.8Gbits/s]

>

Not your network.

Watch your nodes with atop and you will note that your HDDs are maxed out.

>

>

>    - I guess Network/SSD is under utilized and it can handle more writes

>    how can this be improved to send more data over network to ssd?

>

As jiajia wrote, a cache-tier might give you some speed boosts.

But with those SSDs I'd advise against it, both too small and too low

endurance.

>

>

>    - rbd kernel module wasn't loaded on compute node,i loaded it manually

>    using "modprobe" and later destroyed/re-created vms,but this doesnot give

>    any performance boost. So librbd and RBD are equally fast?

>

Irrelevant and confusing.

Your VMs will use on or the other depending on how they are configured.

>

>

>    - Samsung evo 840 512Gb shows a throughput of 500Mbps for 4M writes [dd

>    if=/dev/zero of=/dev/sdb bs=4M count=1000 oflag=direct] and for 4Kb it was

>    equally fast as that of intel S3500 80gb .Does changing my SSD from intel

>    s3500 100Gb to Samsung 840 500Gb make any performance  difference here just

>    because for 4M wirtes samsung 840 evo is faster?Can Ceph utilize this extra

>    speed.Since samsung evo 840 is faster in 4M writes.

>

Those SSDs would be an even worse choice for endurance/reliability

reasons, though their larger size offsets that a bit.

Unless you have a VERY good understanding and data on how much your

cluster is going to write, pick at the very least SSDs with 3+ DWPD

endurance like the DC S3610s.

In very light loaded cases DC S3520 with 1DWPD may be OK, but again, you

need to know what you're doing here.

Christian

>

> Can somebody help me understand this better.

>

> Regards,

> Kevin

--

Christian Balzer        Network/Systems Engineer

chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications

http://www.gol.com/

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com