The numbers are very low. I would first benchmark the system without the vm client using rbd 4k test such as:
rbd bench-write image01 --pool=rbd --io-threads=32 --io-size 4096
--io-pattern rand --rbd_cache=false
--io-pattern rand --rbd_cache=false
-------- Original message --------
From: kevin parrikar <kevin.parker092@xxxxxxxxx>
Date: 07/01/2017 05:48 (GMT+02:00)
To: Christian Balzer <chibi@xxxxxxx>
Cc: ceph-users@xxxxxxxxxxxxxx
Subject: Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release
i really need some help here :(
replaced all 7.2 rpm SAS disks with new Samsung 840 evo 512Gb SSD with no seperate journal Disk .Now both OSD nodes are with 2 ssd disks with a replica of 2 .
Total number of OSD process in the cluster is 4.with all SSD.
How ever i can see both ceph-osd process in atop with 53% and 47% disk utilization.
Virtual Machine Configuration:
Any guidance on where to look for issues.
Regards,
replaced all 7.2 rpm SAS disks with new Samsung 840 evo 512Gb SSD with no seperate journal Disk .Now both OSD nodes are with 2 ssd disks with a replica of 2 .
Total number of OSD process in the cluster is 4.with all SSD.
But throughput has gone down from 1.4 MB/s to 1.3 MB/s for 4k writes and for 4M it has gone down from 140MB/s to 126MB/s .
now atop no longer shows OSD device as 100% busy..
now atop no longer shows OSD device as 100% busy..
How ever i can see both ceph-osd process in atop with 53% and 47% disk utilization.
PID RDDSK WRDSK WCANCL DSK CMD 1/2
20771 0K 648.8M 0K 53% ceph-osd
19547 0K 576.7M 0K 47% ceph-osd
OSD disks(ssd) utilization from atop
atop inside virtual machine:[4 CPU/3Gb RAM]
DSK | vdc | busy 96% | read 0 | write 256 | KiB/r 0 | KiB/w 512 | MBr/s 0.00 | MBw/s 128.00 | avq 7.96 | avio 3.77 ms |
Both Guest and Host are using deadline I/O scheduler
OSD disks(ssd) utilization from atop
DSK | sdc | busy 6% | read 0 | write 517 | KiB/r 0 | KiB/w 293 | MBr/s 0.00 | MBw/s 148.18 | avq 9.44 | avio 0.12 ms |
DSK | sdd | busy 5% | read 0 | write 336 | KiB/r 0 | KiB/w 292 | MBr/s 0.00 | MBw/s 96.12 | avq 7.62 | avio 0.15 ms |
Queue Depth of OSD disks
cat /sys/block/sdd/device//queue_depth
256
DSK | vdc | busy 96% | read 0 | write 256 | KiB/r 0 | KiB/w 512 | MBr/s 0.00 | MBw/s 128.00 | avq 7.96 | avio 3.77 ms |
Both Guest and Host are using deadline I/O scheduler
Virtual Machine Configuration:
</disk>
<disk type='network' device='disk'>
<driver name='qemu' type='raw' cache='writeback'/>
<auth username='compute'>
<secret type='ceph' uuid='a5d0dd94-57c4-ae55-ffe0-7e3732a24455'/>
</auth>
<source protocol='rbd' name='volumes/volume-449da0e7-6223-457c-b2c6-b5e112099212'>
<host name='172.16.1.8' port='6789'/>
<host name='172.16.1.11' port='6789'/>
<host name='172.16.1.12' port='6789'/>
</source>
<target dev='vdb' bus='virtio'/>
<serial>449da0e7-6223-457c-b2c6-b5e112099212</serial>
<address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
</disk>
ceph.conf cat /etc/ceph/ceph.conf
[global]
fsid = c4e1a523-9017-492e-9c30-8350eba1bd51
mon_initial_members = node-16 node-30 node-31
mon_host = 172.16.1.11 172.16.1.12 172.16.1.8
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
log_to_syslog_level = info
log_to_syslog = True
osd_pool_default_size = 2
osd_pool_default_min_size = 1
osd_pool_default_pg_num = 64
public_network = 172.16.1.0/24
log_to_syslog_facility = LOG_LOCAL0
osd_journal_size = 2048
auth_supported = cephx
osd_pool_default_pgp_num = 64
osd_mkfs_type = xfs
cluster_network = 172.16.1.0/24
osd_recovery_max_active = 1
osd_max_backfills = 1
[client]
rbd_cache_writethrough_until_flush = True
rbd_cache = True
[client.radosgw.gateway]
rgw_keystone_accepted_roles = _member_, Member, admin, swiftoperator
keyring = /etc/ceph/keyring.radosgw.gateway
rgw_frontends = fastcgi socket_port=9000 socket_host=127.0.0.1
rgw_socket_path = /tmp/radosgw.sock
rgw_keystone_revocation_interval = 1000000
Any guidance on where to look for issues.
Regards,
Kevin
On Fri, Jan 6, 2017 at 4:42 PM, kevin parrikar <kevin.parker092@xxxxxxxxx> wrote:
Thanks Christian for your valuable comments,each comment is a new learning for me.
Please see inlineOn Fri, Jan 6, 2017 at 9:32 AM, Christian Balzer <chibi@xxxxxxx> wrote:
Hello,
On Fri, 6 Jan 2017 08:40:36 +0530 kevin parrikar wrote:
> Hello All,
>
> I have setup a ceph cluster based on 0.94.6 release in 2 servers each with
> 80Gb intel s3510 and 2x3 Tb 7.2 SATA disks,16 CPU,24G RAM
> which is connected to a 10G switch with a replica of 2 [ i will add 3 more
> servers to the cluster] and 3 seperate monitor nodes which are vms.
>
I'd go to the latest hammer, this version has a lethal cache-tier bug if
you should decide to try that.
80Gb Intel DC S3510 are a) slow and b) have only 0.3 DWPD.
You're going to wear those out quickly and if not replaced in time loose
data.
2 HDDs give you a theoretical speed of something like 300MB/s sustained,
when used a OSDs I'd expect the usual 50-60MB/s per OSD due to
seeks, journal (file system) and leveldb overheads.
Which perfectly matches your results.Hmmmm that makes sense ,its hitting 7.2 rpm OSD's peak write speed.I was in an assumption that ssd Journal to OSD will happen slowly at a later time and hence i could use slower and cheaper disks for OSD.But in practise it looks like many articles in the internet that talks about faster journal and slower OSD dont seems to be correct.
Will adding more OSD disks per node improve the overall performance?
i can add 4 more disks to each node,but all are 7.2 rpm disks .I am expecting some kind of parallel writes on these disks and magically improves performance :DThis is my second experiment with Ceph last time i gave up and purchased another costly solution from a vendor.But this time i am determined to fix all issues and bring up a solid cluster .
Last time clsuter was giving a throughput of around 900kbps for 1G writes from virtual machine and now things have improved ,its giving 1.4 Mbps but still far slower than the target of 24Mbps.
Expecting to make some progress with the help of experts here :)
> rbd_cache is enabled in configurations,XFS filesystem,LSI 92465-4i raid
> card with 512Mb cache [ssd is in writeback mode wth BBU]
>
>
> Before installing ceph, i tried to check max throughpit of intel 3500 80G
> SSD using block size of 4M [i read somewhere that ceph uses 4m objects] and
> it was giving 220mbps {dd if=/dev/zero of=/dev/sdb bs=4M count=1000
> oflag=direct}
>
Irrelevant, sustained sequential writes will be limited by what your OSDs
(HDDs) can sustain.
> *Observation:*
> Now the cluster is up and running and from the vm i am trying to write a 4g
> file to its volume using dd if=/dev/zero of=/dev/sdb bs=4M count=1000
> oflag=direct .It takes aroud 39 seconds to write.
>
> during this time ssd journal was showing disk write of 104M on both the
> ceph servers (dstat sdb) and compute node a network transfer rate of ~110M
> on its 10G storage interface(dstat -nN eth2]
>
As I said, sounds about right.
>
> my questions are:
>
>
> - Is this the best throughput ceph can offer or can anything in my
> environment be optmised to get more performance? [iperf shows a max
> throughput 9.8Gbits/s]
>
Not your network.
Watch your nodes with atop and you will note that your HDDs are maxed out.
>
>
> - I guess Network/SSD is under utilized and it can handle more writes
> how can this be improved to send more data over network to ssd?
>
As jiajia wrote, a cache-tier might give you some speed boosts.
But with those SSDs I'd advise against it, both too small and too low
endurance.
>
>
> - rbd kernel module wasn't loaded on compute node,i loaded it manually
> using "modprobe" and later destroyed/re-created vms,but this doesnot give
> any performance boost. So librbd and RBD are equally fast?
>
Irrelevant and confusing.
Your VMs will use on or the other depending on how they are configured.
>
>
> - Samsung evo 840 512Gb shows a throughput of 500Mbps for 4M writes [dd
> if=/dev/zero of=/dev/sdb bs=4M count=1000 oflag=direct] and for 4Kb it was
> equally fast as that of intel S3500 80gb .Does changing my SSD from intel
> s3500 100Gb to Samsung 840 500Gb make any performance difference here just
> because for 4M wirtes samsung 840 evo is faster?Can Ceph utilize this extra
> speed.Since samsung evo 840 is faster in 4M writes.
>
Those SSDs would be an even worse choice for endurance/reliability
reasons, though their larger size offsets that a bit.
Unless you have a VERY good understanding and data on how much your
cluster is going to write, pick at the very least SSDs with 3+ DWPD
endurance like the DC S3610s.
In very light loaded cases DC S3520 with 1DWPD may be OK, but again, you
need to know what you're doing here.
Christian
-->
> Can somebody help me understand this better.
>
> Regards,
> Kevin
Christian Balzer Network/Systems Engineer
chibi@xxxxxxx Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com