I am not sure why, but I cannot get Jumbo Frames to work properly:
root@virt2:~# ping -M do -s 8972 -c 4 10.10.10.83
PING 10.10.10.83 (10.10.10.83) 8972(9000) bytes of data.
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
Jumbo Frames is on, on the switch and on the NIC's:
ens2f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 10.10.10.83 netmask 255.255.255.0 broadcast 10.10.10.255
inet6 fe80::ec4:7aff:feea:7b40 prefixlen 64 scopeid 0x20<link>
ether 0c:c4:7a:ea:7b:40 txqueuelen 1000 (Ethernet)
RX packets 166440655 bytes 229547410625 (213.7 GiB)
RX errors 0 dropped 223 overruns 0 frame 0
TX packets 142788790 bytes 188658602086 (175.7 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
root@virt2:~# ifconfig ens2f0
ens2f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 10.10.10.82 netmask 255.255.255.0 broadcast 10.10.10.255
inet6 fe80::ec4:7aff:feea:ff2c prefixlen 64 scopeid 0x20<link>
ether 0c:c4:7a:ea:ff:2c txqueuelen 1000 (Ethernet)
RX packets 466774 bytes 385578454 (367.7 MiB)
RX errors 4 dropped 223 overruns 0 frame 3
TX packets 594975 bytes 580053745 (553.1 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
On Mon, Nov 20, 2017 at 2:13 PM, Sébastien VIGNERON <sebastien.vigneron@xxxxxxxxx> wrote:
As a jumbo frame test, can you try the following?ping -M do -s 8972 -c 4 IP_of_other_node_within_cluster_network If you have « ping: sendto: Message too long », jumbo frames are not activated.Cordialement / Best regards,
Sébastien VIGNERON
CRIANN,
Ingénieur / Engineer
Technopôle du Madrillet
745, avenue de l'Université
76800 Saint-Etienne du Rouvray - France
tél. +33 2 32 91 42 91
fax. +33 2 32 91 42 92
http://www.criann.fr
mailto:sebastien.vigneron@criann.fr
support: support@xxxxxxxxxLe 20 nov. 2017 à 13:02, Rudi Ahlers <rudiahlers@xxxxxxxxx> a écrit :______________________________We're planning on installing 12X Virtual Machines with some heavy loads.the SSD drives are INTEL SSDSC2BA400G4The SATA drives are ST8000NM0055-1RM112Please explain your comment, "b) will find a lot of people here who don't approve of it."I don't have access to the switches right now, but they're new so whatever default config ships from factory would be active. Though iperf shows 10.5 GBytes / 9.02 Gbits/sec throughput.What speeds would you expect?"Though with your setup I would have expected something faster, but NOT thetheoretical 600MB/s 4 HDDs will do in sequential writes."On this, "If an OSD has no fast WAL/DB, it will drag the overall speed down. Verify and if so fix this and re-test.": how?On Mon, Nov 20, 2017 at 1:44 PM, Christian Balzer <chibi@xxxxxxx> wrote:On Mon, 20 Nov 2017 12:38:55 +0200 Rudi Ahlers wrote:
> Hi,
>
> Can someone please help me, how do I improve performance on ou CEPH cluster?
>
> The hardware in use are as follows:
> 3x SuperMicro servers with the following configuration
> 12Core Dual XEON 2.2Ghz
Faster cores is better for Ceph, IMNSHO.
Though with main storage on HDDs, this will do.
> 128GB RAM
Overkill for Ceph but I see something else below...
> 2x 400GB Intel DC SSD drives
Exact model please.
> 4x 8TB Seagate 7200rpm 6Gbps SATA HDD's
One hopes that's a non SMR one.
Model please.
> 1x SuperMicro DOM for Proxmox / Debian OS
Ah, Proxmox.
I'm personally not averse to converged, high density, multi-role clusters
myself, but you:
a) need to know what you're doing and
b) will find a lot of people here who don't approve of it.
I've avoided DOMs so far (non-hotswapable SPOF), even though the SM ones
look good on paper with regards to endurance and IOPS.
The later being rather important for your monitors.
> 4x Port 10Gbe NIC
> Cisco 10Gbe switch.
>
Configuration would be nice for those, LACP?
>
> root@virt2:~# rados bench -p Data 10 write --no-cleanup
> hints = 1
> Maintaining 16 concurrent writes of 4194304 bytes to objects of size
> 4194304 for up to 10 seconds or 0 objects
rados bench is limited tool and measuring bandwidth is in nearly all
the use cases pointless.
Latency is where it is at and testing from inside a VM is more relevant
than synthetic tests of the storage.
But it is a start.What part of this surprises you?
> Object prefix: benchmark_data_virt2_39099
> sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
> lat(s)
> 0 0 0 0 0 0 -
> 0
> 1 16 85 69 275.979 276 0.185576
> 0.204146
> 2 16 171 155 309.966 344 0.0625409
> 0.193558
> 3 16 243 227 302.633 288 0.0547129
> 0.19835
> 4 16 330 314 313.965 348 0.0959492
> 0.199825
> 5 16 413 397 317.565 332 0.124908
> 0.196191
> 6 16 494 478 318.633 324 0.1556
> 0.197014
> 7 15 591 576 329.109 392 0.136305
> 0.192192
> 8 16 670 654 326.965 312 0.0703808
> 0.190643
> 9 16 757 741 329.297 348 0.165211
> 0.192183
> 10 16 828 812 324.764 284 0.0935803
> 0.194041
> Total time run: 10.120215
> Total writes made: 829
> Write size: 4194304
> Object size: 4194304
> Bandwidth (MB/sec): 327.661
With a replication of 3, you have effectively the bandwidth of your 2 SSDs
(for small writes, not the case here) and the bandwidth of your 4 HDDs
available.
Given overhead, other inefficiencies and the fact that this is not a
sequential write from the HDD perspective, 320MB/s isn't all that bad.
Though with your setup I would have expected something faster, but NOT the
theoretical 600MB/s 4 HDDs will do in sequential writes.Bluestore has no journals, don't confuse it and the people you're asking
> Stddev Bandwidth: 35.8664
> Max bandwidth (MB/sec): 392
> Min bandwidth (MB/sec): 276
> Average IOPS: 81
> Stddev IOPS: 8
> Max IOPS: 98
> Min IOPS: 69
> Average Latency(s): 0.195191
> Stddev Latency(s): 0.0830062
> Max latency(s): 0.481448
> Min latency(s): 0.0414858
> root@virt2:~# hdparm -I /dev/sda
>
>
>
> root@virt2:~# ceph osd tree
> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
> -1 72.78290 root default
> -3 29.11316 host virt1
> 1 hdd 7.27829 osd.1 up 1.00000 1.00000
> 2 hdd 7.27829 osd.2 up 1.00000 1.00000
> 3 hdd 7.27829 osd.3 up 1.00000 1.00000
> 4 hdd 7.27829 osd.4 up 1.00000 1.00000
> -5 21.83487 host virt2
> 5 hdd 7.27829 osd.5 up 1.00000 1.00000
> 6 hdd 7.27829 osd.6 up 1.00000 1.00000
> 7 hdd 7.27829 osd.7 up 1.00000 1.00000
> -7 21.83487 host virt3
> 8 hdd 7.27829 osd.8 up 1.00000 1.00000
> 9 hdd 7.27829 osd.9 up 1.00000 1.00000
> 10 hdd 7.27829 osd.10 up 1.00000 1.00000
> 0 0 osd.0 down 0 1.00000
>
>
> root@virt2:~# ceph -s
> cluster:
> id: 278a2e9c-0578-428f-bd5b-3bb348923c27
> health: HEALTH_OK
>
> services:
> mon: 3 daemons, quorum virt1,virt2,virt3
> mgr: virt1(active)
> osd: 11 osds: 10 up, 10 in
>
> data:
> pools: 1 pools, 512 pgs
> objects: 6084 objects, 24105 MB
> usage: 92822 MB used, 74438 GB / 74529 GB avail
> pgs: 512 active+clean
>
> root@virt2:~# ceph -w
> cluster:
> id: 278a2e9c-0578-428f-bd5b-3bb348923c27
> health: HEALTH_OK
>
> services:
> mon: 3 daemons, quorum virt1,virt2,virt3
> mgr: virt1(active)
> osd: 11 osds: 10 up, 10 in
>
> data:
> pools: 1 pools, 512 pgs
> objects: 6084 objects, 24105 MB
> usage: 92822 MB used, 74438 GB / 74529 GB avail
> pgs: 512 active+clean
>
>
> 2017-11-20 12:32:08.199450 mon.virt1 [INF] mon.1 10.10.10.82:6789/0
>
>
>
> The SSD drives are used as journal drives:
>
for help.
> root@virt3:~# ceph-disk list | grep /dev/sde | grep osd
> /dev/sdb1 ceph data, active, cluster ceph, osd.8, block /dev/sdb2,
> block.db /dev/sde1
> root@virt3:~# ceph-disk list | grep /dev/sdf | grep osd
> /dev/sdc1 ceph data, active, cluster ceph, osd.9, block /dev/sdc2,
> block.db /dev/sdf1
> /dev/sdd1 ceph data, active, cluster ceph, osd.10, block /dev/sdd2,
> block.db /dev/sdf2
>
>
>
> I see now /dev/sda doesn't have a journal, though it should have. Not sure
> why.
If an OSD has no fast WAL/DB, it will drag the overall speed down.
Verify and if so fix this and re-test.
Christian
> This is the command I used to create it:
>
>
> pveceph createosd /dev/sda -bluestore 1 -journal_dev /dev/sde
>
>
--
Christian Balzer Network/Systems Engineer
chibi@xxxxxxx Rakuten Communications--_________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com