On Mon, Nov 20, 2017 at 2:36 PM, Christian Balzer <chibi@xxxxxxx> wrote:
On Mon, 20 Nov 2017 14:02:30 +0200 Rudi Ahlers wrote:
> We're planning on installing 12X Virtual Machines with some heavy loads.
>
> the SSD drives are INTEL SSDSC2BA400G4
>
Interesting, where did you find those?
Or did you have them lying around?
I've been unable to get DC S3710 SSDs for nearly a year now.
In South Africa, one of our suppliers had some in stock. They're still fairly new, about 2 months old now.
> The SATA drives are ST8000NM0055-1RM112
>
Note that these (while fast) have an internal flash cache, limiting them to
something like 0.2 DWPD.
Probably not an issue with the WAL/DB on the Intels, but something to keep
in mind.
I don't quite understand what you want to say, please explain?
> Please explain your comment, "b) will find a lot of people here who don't
> approve of it."
>
Read the archives.
Converged clusters are complex and debugging Ceph when tons of other
things are going on at the same time on the machine even more so.
Ok, so I have 4 physical servers and need to setup a highly redundant cluster. How else would you have done it? There is no budget for a SAN, let alone a highly available SAN.
> I don't have access to the switches right now, but they're new so whatever
> default config ships from factory would be active. Though iperf shows 10.5
> GBytes / 9.02 Gbits/sec throughput.
>
Didn't think it was the switches, but completeness sake and all that.
> What speeds would you expect?
> "Though with your setup I would have expected something faster, but NOT the
> theoretical 600MB/s 4 HDDs will do in sequential writes."
>
What I wrote.
A 7200RPM HDD, even these, can not sustain writes much over 170MB/s, in
the most optimal circumstances.
So your cluster can NOT exceed about 600MB/s sustained writes with the
effective bandwidth of 4 HDDs.
Smaller writes/reads that can be cached by RAM, DB, onboard caches on the
HDDs of course can and will be faster.
But again, you're missing the point, even if you get 600MB/s writes out of
your cluster, the number of 4k IOPS will be much more relevant to your VMs.
hdparm shows about 230MB/s:
^Croot@virt2:~# hdparm -Tt /dev/sda
/dev/sda:
Timing cached reads: 20250 MB in 2.00 seconds = 10134.81 MB/sec
Timing buffered disk reads: 680 MB in 3.00 seconds = 226.50 MB/sec
600MB/s would be super nice, but in reality even 400MB/s would be nice. Would it not be achievable?
>
>
> On this, "If an OSD has no fast WAL/DB, it will drag the overall speed
> down. Verify and if so fix this and re-test.": how?
>
No idea, I don't do bluestore.
You noticed the lack of a WAL/DB for sda, go and fix it.
If in in doubt by destroying and re-creating.
And if you're looking for a less invasive procedure, docs and the ML
archive, but AFAIK there is nothing but re-creation at this time.
Since I use Proxmox, which setup a DB device, but not a WAL device.
Christian
>
> On Mon, Nov 20, 2017 at 1:44 PM, Christian Balzer <chibi@xxxxxxx> wrote:
>
> > On Mon, 20 Nov 2017 12:38:55 +0200 Rudi Ahlers wrote:
> >
> > > Hi,
> > >
> > > Can someone please help me, how do I improve performance on ou CEPH
> > cluster?
> > >
> > > The hardware in use are as follows:
> > > 3x SuperMicro servers with the following configuration
> > > 12Core Dual XEON 2.2Ghz
> > Faster cores is better for Ceph, IMNSHO.
> > Though with main storage on HDDs, this will do.
> >
> > > 128GB RAM
> > Overkill for Ceph but I see something else below...
> >
> > > 2x 400GB Intel DC SSD drives
> > Exact model please.
> >
> > > 4x 8TB Seagate 7200rpm 6Gbps SATA HDD's
> > One hopes that's a non SMR one.
> > Model please.
> >
> > > 1x SuperMicro DOM for Proxmox / Debian OS
> > Ah, Proxmox.
> > I'm personally not averse to converged, high density, multi-role clusters
> > myself, but you:
> > a) need to know what you're doing and
> > b) will find a lot of people here who don't approve of it.
> >
> > I've avoided DOMs so far (non-hotswapable SPOF), even though the SM ones
> > look good on paper with regards to endurance and IOPS.
> > The later being rather important for your monitors.
> >
> > > 4x Port 10Gbe NIC
> > > Cisco 10Gbe switch.
> > >
> > Configuration would be nice for those, LACP?
> >
> > >
> > > root@virt2:~# rados bench -p Data 10 write --no-cleanup
> > > hints = 1
> > > Maintaining 16 concurrent writes of 4194304 bytes to objects of size
> > > 4194304 for up to 10 seconds or 0 objects
> >
> > rados bench is limited tool and measuring bandwidth is in nearly all
> > the use cases pointless.
> > Latency is where it is at and testing from inside a VM is more relevant
> > than synthetic tests of the storage.
> > But it is a start.
> >
> > > Object prefix: benchmark_data_virt2_39099
> > > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
> > > lat(s)
> > > 0 0 0 0 0 0 -
> > > 0
> > > 1 16 85 69 275.979 276 0.185576
> > > 0.204146
> > > 2 16 171 155 309.966 344 0.0625409
> > > 0.193558
> > > 3 16 243 227 302.633 288 0.0547129
> > > 0.19835
> > > 4 16 330 314 313.965 348 0.0959492
> > > 0.199825
> > > 5 16 413 397 317.565 332 0.124908
> > > 0.196191
> > > 6 16 494 478 318.633 324 0.1556
> > > 0.197014
> > > 7 15 591 576 329.109 392 0.136305
> > > 0.192192
> > > 8 16 670 654 326.965 312 0.0703808
> > > 0.190643
> > > 9 16 757 741 329.297 348 0.165211
> > > 0.192183
> > > 10 16 828 812 324.764 284 0.0935803
> > > 0.194041
> > > Total time run: 10.120215
> > > Total writes made: 829
> > > Write size: 4194304
> > > Object size: 4194304
> > > Bandwidth (MB/sec): 327.661
> > What part of this surprises you?
> >
> > With a replication of 3, you have effectively the bandwidth of your 2 SSDs
> > (for small writes, not the case here) and the bandwidth of your 4 HDDs
> > available.
> > Given overhead, other inefficiencies and the fact that this is not a
> > sequential write from the HDD perspective, 320MB/s isn't all that bad.
> > Though with your setup I would have expected something faster, but NOT the
> > theoretical 600MB/s 4 HDDs will do in sequential writes.
> >
> > > Stddev Bandwidth: 35.8664
> > > Max bandwidth (MB/sec): 392
> > > Min bandwidth (MB/sec): 276
> > > Average IOPS: 81
> > > Stddev IOPS: 8
> > > Max IOPS: 98
> > > Min IOPS: 69
> > > Average Latency(s): 0.195191
> > > Stddev Latency(s): 0.0830062
> > > Max latency(s): 0.481448
> > > Min latency(s): 0.0414858
> > > root@virt2:~# hdparm -I /dev/sda
> > >
> > >
> > >
> > > root@virt2:~# ceph osd tree
> > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
> > > -1 72.78290 root default
> > > -3 29.11316 host virt1
> > > 1 hdd 7.27829 osd.1 up 1.00000 1.00000
> > > 2 hdd 7.27829 osd.2 up 1.00000 1.00000
> > > 3 hdd 7.27829 osd.3 up 1.00000 1.00000
> > > 4 hdd 7.27829 osd.4 up 1.00000 1.00000
> > > -5 21.83487 host virt2
> > > 5 hdd 7.27829 osd.5 up 1.00000 1.00000
> > > 6 hdd 7.27829 osd.6 up 1.00000 1.00000
> > > 7 hdd 7.27829 osd.7 up 1.00000 1.00000
> > > -7 21.83487 host virt3
> > > 8 hdd 7.27829 osd.8 up 1.00000 1.00000
> > > 9 hdd 7.27829 osd.9 up 1.00000 1.00000
> > > 10 hdd 7.27829 osd.10 up 1.00000 1.00000
> > > 0 0 osd.0 down 0 1.00000
> > >
> > >
> > > root@virt2:~# ceph -s
> > > cluster:
> > > id: 278a2e9c-0578-428f-bd5b-3bb348923c27
> > > health: HEALTH_OK
> > >
> > > services:
> > > mon: 3 daemons, quorum virt1,virt2,virt3
> > > mgr: virt1(active)
> > > osd: 11 osds: 10 up, 10 in
> > >
> > > data:
> > > pools: 1 pools, 512 pgs
> > > objects: 6084 objects, 24105 MB
> > > usage: 92822 MB used, 74438 GB / 74529 GB avail
> > > pgs: 512 active+clean
> > >
> > > root@virt2:~# ceph -w
> > > cluster:
> > > id: 278a2e9c-0578-428f-bd5b-3bb348923c27
> > > health: HEALTH_OK
> > >
> > > services:
> > > mon: 3 daemons, quorum virt1,virt2,virt3
> > > mgr: virt1(active)
> > > osd: 11 osds: 10 up, 10 in
> > >
> > > data:
> > > pools: 1 pools, 512 pgs
> > > objects: 6084 objects, 24105 MB
> > > usage: 92822 MB used, 74438 GB / 74529 GB avail
> > > pgs: 512 active+clean
> > >
> > >
> > > 2017-11-20 12:32:08.199450 mon.virt1 [INF] mon.1 10.10.10.82:6789/0
> > >
> > >
> > >
> > > The SSD drives are used as journal drives:
> > >
> > Bluestore has no journals, don't confuse it and the people you're asking
> > for help.
> >
> > > root@virt3:~# ceph-disk list | grep /dev/sde | grep osd
> > > /dev/sdb1 ceph data, active, cluster ceph, osd.8, block /dev/sdb2,
> > > block.db /dev/sde1
> > > root@virt3:~# ceph-disk list | grep /dev/sdf | grep osd
> > > /dev/sdc1 ceph data, active, cluster ceph, osd.9, block /dev/sdc2,
> > > block.db /dev/sdf1
> > > /dev/sdd1 ceph data, active, cluster ceph, osd.10, block /dev/sdd2,
> > > block.db /dev/sdf2
> > >
> > >
> > >
> > > I see now /dev/sda doesn't have a journal, though it should have. Not
> > sure
> > > why.
> > If an OSD has no fast WAL/DB, it will drag the overall speed down.
> >
> > Verify and if so fix this and re-test.
> >
> > Christian
> >
> > > This is the command I used to create it:
> > >
> > >
> > > pveceph createosd /dev/sda -bluestore 1 -journal_dev /dev/sde
> > >
> > >
> >
> >
> > --
> > Christian Balzer Network/Systems Engineer
> > chibi@xxxxxxx Rakuten Communications
> >
>
>
>
--
Christian Balzer Network/Systems Engineer
chibi@xxxxxxx Rakuten Communications
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com