Re: how to improve performance

Rudi Ahlers <rudiahlers@xxxxxxxxx> · Tue, 21 Nov 2017 09:21:58 +0200

On Mon, Nov 20, 2017 at 2:36 PM, Christian Balzer <chibi@xxxxxxx> wrote:
On Mon, 20 Nov 2017 14:02:30 +0200 Rudi Ahlers wrote:

> We're planning on installing 12X Virtual Machines with some heavy loads.

>

> the SSD drives are  INTEL SSDSC2BA400G4

>

Interesting, where did you find those?

Or did you have them lying around?

I've been unable to get DC S3710 SSDs for nearly a year now.

In South Africa, one of our suppliers had some in stock. They're still fairly new, about 2 months old now. 

> The SATA drives are ST8000NM0055-1RM112

>

Note that these (while fast) have an internal flash cache, limiting them to

something like 0.2 DWPD.

Probably not an issue with the WAL/DB on the Intels, but something to keep

in mind.

I don't quite understand what you want to say, please explain?

> Please explain your comment, "b) will find a lot of people here who don't

> approve of it."

>

Read the archives.

Converged clusters are complex and debugging Ceph when tons of other

things are going on at the same time on the machine even more so.

Ok, so I have 4 physical servers and need to setup a highly redundant cluster. How else would you have done it? There is no budget for a SAN, let alone a highly available SAN. 

> I don't have access to the switches right now, but they're new so whatever

> default config ships from factory would be active. Though iperf shows 10.5

> GBytes  / 9.02 Gbits/sec throughput.

>

Didn't think it was the switches, but completeness sake and all that.

> What speeds would you expect?

> "Though with your setup I would have expected something faster, but NOT the

> theoretical 600MB/s 4 HDDs will do in sequential writes."

>

What I wrote.

A 7200RPM HDD, even these, can not sustain writes much over 170MB/s, in

the most optimal circumstances.

So your cluster can NOT exceed about 600MB/s sustained writes with the

effective bandwidth of 4 HDDs.

Smaller writes/reads that can be cached by RAM, DB, onboard caches on the

HDDs of course can and will be faster.

But again, you're missing the point, even if you get 600MB/s writes out of

your cluster, the number of 4k IOPS will be much more relevant to your VMs.

hdparm shows about 230MB/s:

^Croot@virt2:~# hdparm -Tt /dev/sda

/dev/sda:
 Timing cached reads:   20250 MB in  2.00 seconds = 10134.81 MB/sec
 Timing buffered disk reads: 680 MB in  3.00 seconds = 226.50 MB/sec

600MB/s would be super nice, but in reality even 400MB/s would be nice. Would it not be achievable?

>

>

> On this, "If an OSD has no fast WAL/DB, it will drag the overall speed

> down. Verify and if so fix this and re-test.": how?

>

No idea, I don't do bluestore.

You noticed the lack of a WAL/DB for sda, go and fix it.

If in in doubt by destroying and re-creating.

And if you're looking for a less invasive procedure, docs and the ML

archive, but AFAIK there is nothing but re-creation at this time.

Since I use Proxmox, which setup a DB device, but not a WAL device.  

Christian

>

> On Mon, Nov 20, 2017 at 1:44 PM, Christian Balzer <chibi@xxxxxxx> wrote:

>

> > On Mon, 20 Nov 2017 12:38:55 +0200 Rudi Ahlers wrote:

> >

> > > Hi,

> > >

> > > Can someone please help me, how do I improve performance on ou CEPH

> > cluster?

> > >

> > > The hardware in use are as follows:

> > > 3x SuperMicro servers with the following configuration

> > > 12Core Dual XEON 2.2Ghz

> > Faster cores is better for Ceph, IMNSHO.

> > Though with main storage on HDDs, this will do.

> >

> > > 128GB RAM

> > Overkill for Ceph but I see something else below...

> >

> > > 2x 400GB Intel DC SSD drives

> > Exact model please.

> >

> > > 4x 8TB Seagate 7200rpm 6Gbps SATA HDD's

> > One hopes that's a non SMR one.

> > Model please.

> >

> > > 1x SuperMicro DOM for Proxmox / Debian OS

> > Ah, Proxmox.

> > I'm personally not averse to converged, high density, multi-role clusters

> > myself, but you:

> > a) need to know what you're doing and

> > b) will find a lot of people here who don't approve of it.

> >

> > I've avoided DOMs so far (non-hotswapable SPOF), even though the SM ones

> > look good on paper with regards to endurance and IOPS.

> > The later being rather important for your monitors.

> >

> > > 4x Port 10Gbe NIC

> > > Cisco 10Gbe switch.

> > >

> > Configuration would be nice for those, LACP?

> >

> > >

> > > root@virt2:~# rados bench -p Data 10 write --no-cleanup

> > > hints = 1

> > > Maintaining 16 concurrent writes of 4194304 bytes to objects of size

> > > 4194304 for       up to 10 seconds or 0 objects

> >

> > rados bench is limited tool and measuring bandwidth is in nearly all

> > the use cases pointless.

> > Latency is where it is at and testing from inside a VM is more relevant

> > than synthetic tests of the storage.

> > But it is a start.

> >

> > > Object prefix: benchmark_data_virt2_39099

> > >   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg

> > > lat(s)

> > >     0       0         0         0         0         0           -

> > >  0

> > >     1      16        85        69   275.979       276    0.185576

> > > 0.204146

> > >     2      16       171       155   309.966       344   0.0625409

> > > 0.193558

> > >     3      16       243       227   302.633       288   0.0547129

> > >  0.19835

> > >     4      16       330       314   313.965       348   0.0959492

> > > 0.199825

> > >     5      16       413       397   317.565       332    0.124908

> > > 0.196191

> > >     6      16       494       478   318.633       324      0.1556

> > > 0.197014

> > >     7      15       591       576   329.109       392    0.136305

> > > 0.192192

> > >     8      16       670       654   326.965       312   0.0703808

> > > 0.190643

> > >     9      16       757       741   329.297       348    0.165211

> > > 0.192183

> > >    10      16       828       812   324.764       284   0.0935803

> > > 0.194041

> > > Total time run:         10.120215

> > > Total writes made:      829

> > > Write size:             4194304

> > > Object size:            4194304

> > > Bandwidth (MB/sec):     327.661

> > What part of this surprises you?

> >

> > With a replication of 3, you have effectively the bandwidth of your 2 SSDs

> > (for small writes, not the case here) and the bandwidth of your 4 HDDs

> > available.

> > Given overhead, other inefficiencies and the fact that this is not a

> > sequential write from the HDD perspective, 320MB/s isn't all that bad.

> > Though with your setup I would have expected something faster, but NOT the

> > theoretical 600MB/s 4 HDDs will do in sequential writes.

> >

> > > Stddev Bandwidth:       35.8664

> > > Max bandwidth (MB/sec): 392

> > > Min bandwidth (MB/sec): 276

> > > Average IOPS:           81

> > > Stddev IOPS:            8

> > > Max IOPS:               98

> > > Min IOPS:               69

> > > Average Latency(s):     0.195191

> > > Stddev Latency(s):      0.0830062

> > > Max latency(s):         0.481448

> > > Min latency(s):         0.0414858

> > > root@virt2:~# hdparm -I /dev/sda

> > >

> > >

> > >

> > > root@virt2:~# ceph osd tree

> > > ID CLASS WEIGHT   TYPE NAME      STATUS REWEIGHT PRI-AFF

> > > -1       72.78290 root default

> > > -3       29.11316     host virt1

> > >  1   hdd  7.27829         osd.1      up  1.00000 1.00000

> > >  2   hdd  7.27829         osd.2      up  1.00000 1.00000

> > >  3   hdd  7.27829         osd.3      up  1.00000 1.00000

> > >  4   hdd  7.27829         osd.4      up  1.00000 1.00000

> > > -5       21.83487     host virt2

> > >  5   hdd  7.27829         osd.5      up  1.00000 1.00000

> > >  6   hdd  7.27829         osd.6      up  1.00000 1.00000

> > >  7   hdd  7.27829         osd.7      up  1.00000 1.00000

> > > -7       21.83487     host virt3

> > >  8   hdd  7.27829         osd.8      up  1.00000 1.00000

> > >  9   hdd  7.27829         osd.9      up  1.00000 1.00000

> > > 10   hdd  7.27829         osd.10     up  1.00000 1.00000

> > >  0              0 osd.0            down        0 1.00000

> > >

> > >

> > > root@virt2:~# ceph -s

> > >   cluster:

> > >     id:     278a2e9c-0578-428f-bd5b-3bb348923c27

> > >     health: HEALTH_OK

> > >

> > >   services:

> > >     mon: 3 daemons, quorum virt1,virt2,virt3

> > >     mgr: virt1(active)

> > >     osd: 11 osds: 10 up, 10 in

> > >

> > >   data:

> > >     pools:   1 pools, 512 pgs

> > >     objects: 6084 objects, 24105 MB

> > >     usage:   92822 MB used, 74438 GB / 74529 GB avail

> > >     pgs:     512 active+clean

> > >

> > > root@virt2:~# ceph -w

> > >   cluster:

> > >     id:     278a2e9c-0578-428f-bd5b-3bb348923c27

> > >     health: HEALTH_OK

> > >

> > >   services:

> > >     mon: 3 daemons, quorum virt1,virt2,virt3

> > >     mgr: virt1(active)

> > >     osd: 11 osds: 10 up, 10 in

> > >

> > >   data:

> > >     pools:   1 pools, 512 pgs

> > >     objects: 6084 objects, 24105 MB

> > >     usage:   92822 MB used, 74438 GB / 74529 GB avail

> > >     pgs:     512 active+clean

> > >

> > >

> > > 2017-11-20 12:32:08.199450 mon.virt1 [INF] mon.1 10.10.10.82:6789/0

> > >

> > >

> > >

> > > The SSD drives are used as journal drives:

> > >

> > Bluestore has no journals, don't confuse it and the people you're asking

> > for help.

> >

> > > root@virt3:~# ceph-disk list | grep /dev/sde | grep osd

> > >  /dev/sdb1 ceph data, active, cluster ceph, osd.8, block /dev/sdb2,

> > > block.db /dev/sde1

> > > root@virt3:~# ceph-disk list | grep /dev/sdf | grep osd

> > >  /dev/sdc1 ceph data, active, cluster ceph, osd.9, block /dev/sdc2,

> > > block.db /dev/sdf1

> > >  /dev/sdd1 ceph data, active, cluster ceph, osd.10, block /dev/sdd2,

> > > block.db /dev/sdf2

> > >

> > >

> > >

> > > I see now /dev/sda doesn't have a journal, though it should have. Not

> > sure

> > > why.

> > If an OSD has no fast WAL/DB, it will drag the overall speed down.

> >

> > Verify and if so fix this and re-test.

> >

> > Christian

> >

> > > This is the command I used to create it:

> > >

> > >

> > >  pveceph createosd /dev/sda -bluestore 1  -journal_dev /dev/sde

> > >

> > >

> >

> >

> > --

> > Christian Balzer        Network/Systems Engineer

> > chibi@xxxxxxx           Rakuten Communications

> >

>

>

>

--

Christian Balzer        Network/Systems Engineer

chibi@xxxxxxx           Rakuten Communications

-- 
Kind Regards
Rudi Ahlers
Website: http://www.rudiahlers.co.za

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com