Re: [External Email] Re: Ceph Nautius not working after setting MTU 9000

"Marc Roos" <M.Roos@xxxxxxxxxxxxxxxxx> · Wed, 27 May 2020 12:00:46 +0200

Interesting table. I have this on a production cluster 10gbit at a 
datacenter (obviously doing not that much). 

[@]# iperf3 -c 10.0.0.13 -P 1 -M 9000
Connecting to host 10.0.0.13, port 5201
[  4] local 10.0.0.14 port 52788 connected to 10.0.0.13 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  1.14 GBytes  9.77 Gbits/sec    0    690 KBytes
[  4]   1.00-2.00   sec  1.15 GBytes  9.90 Gbits/sec    0   1.08 MBytes
[  4]   2.00-3.00   sec  1.15 GBytes  9.88 Gbits/sec    0   1.08 MBytes
[  4]   3.00-4.00   sec  1.15 GBytes  9.88 Gbits/sec    0   1.08 MBytes
[  4]   4.00-5.00   sec  1.15 GBytes  9.88 Gbits/sec    0   1.08 MBytes
[  4]   5.00-6.00   sec  1.15 GBytes  9.90 Gbits/sec    0   1.21 MBytes
[  4]   6.00-7.00   sec  1.15 GBytes  9.89 Gbits/sec    0   1.21 MBytes
[  4]   7.00-8.00   sec  1.15 GBytes  9.88 Gbits/sec    0   1.21 MBytes
[  4]   8.00-9.00   sec  1.15 GBytes  9.89 Gbits/sec    0   1.21 MBytes
[  4]   9.00-10.00  sec  1.15 GBytes  9.89 Gbits/sec    0   1.21 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  11.5 GBytes  9.87 Gbits/sec    0             
sender
[  4]   0.00-10.00  sec  11.5 GBytes  9.87 Gbits/sec                  
receiver

-----Original Message-----
Subject: Re:  Re: [External Email] Re: Ceph Nautius not 
working after setting MTU 9000

To elaborate on some aspects that have been mentioned already and add 
some others::

*	Test using iperf3. 

*	Don't try to use jumbos on networks where you don't have complete 
control over every host. This usually includes the main ceph network. 
It's just too much grief. You can consider using it for limited-access 
networks (e.g. ceph cluster network, hypervisor migration network, etc) 
where you know every switch & host is tuned correctly. (This works even 
when those nets share a vlan trunk with non-jumbo vlans - just set the 
max value on the trunk itself, and individual values on each vlan.)

*	If you are pinging make sure it doesn't fragment otherwise you 
will get misleading results: e.g. ping -M do -s 9000 x.x.x.x
*	Do not assume that 9000 is the best value. It depends on your 
NICs, your switch, kernel/device parameters, etc. Try different values 
(using iperf3). As an example the results below are using a small cheap 
Mikrotek 10G switch and HPE 10G NICs. It highlights how in this 
configuration 9000 is worse than 1500, but that 5139 is optimal yet 5140 
is worst. The same pattern (obviously with different values) was 
apparent when multiple tests were run concurrently. Always test your own 
network in a controlled manner. And of course if you introduce anything 
different later on, test again. With enterprise-grade kit this might not 
be so common, but always test if you fiddle.

MTU  Gbps  (actual data transfer values using iperf3)  - one particular 
configuration only

9600 8.91 (max value)
9000 8.91
8000 8.91
7000 8.91
6000 8.91
5500 8.17
5200 7.71
5150 7.64
5140 7.62
5139 9.81 (optimal)
5138 9.81
5137 9.81
5135 9.81
5130 9.81
5120 9.81
5100 9.81
5000 9.81
4000 9.76
3000 9.68
2000 9.28
1500 9.37 (default)

Whether any of this will make a tangible difference for ceph is moot. I 
just spend a little time getting the network stack correct as above, 
then leave it. That way I know I am probably getting some benefit, and 
not doing any harm. If you blindly change things you may well do harm 
that can manifest itself in all sorts of ways outside of Ceph. Getting 
some test results for this using Ceph will be easy; getting MEANINGFUL 
results that way will be hard.

Chris

On 27/05/2020 09:25, Marc Roos wrote:

	I would not call a ceph page, a random tuning tip. At least I hope 
they 
	are not. NVMe-only with 100Gbit is not really a standard setup. I 
assume 
	with such setup you have the luxury to not notice many 
optimizations. 

	What I mostly read is that changing to mtu 9000 will allow you to 
better 
	saturate the 10Gbit adapter, and I expect this to show on a low end 
busy 
	cluster. Don't you have any test results of such a setup?

	-----Original Message-----

	Subject: Re:  Re: [External Email] Re: Ceph Nautius not 

	working after setting MTU 9000

	Don't optimize stuff without benchmarking *before and after*, don't 

	apply random tuning tipps from the Internet without benchmarking 
them.

	My experience with Jumbo frames: 3% performance. On a NVMe-only 
setup 
	with 100 Gbit/s network.

	Paul

	--
	Paul Emmerich

	Looking for help with your Ceph cluster? Contact us at 
https://croit.io

	croit GmbH
	Freseniusstr. 31h
	81247 München
	www.croit.io
	Tel: +49 89 1896585 90

	On Tue, May 26, 2020 at 7:02 PM Marc Roos 
<M.Roos@xxxxxxxxxxxxxxxxx> <mailto:M.Roos@xxxxxxxxxxxxxxxxx>  
	wrote:

		Look what I have found!!! :)
		https://ceph.com/geen-categorie/ceph-loves-jumbo-frames/ 

		-----Original Message-----
		From: Anthony D'Atri [mailto:anthony.datri@xxxxxxxxx] 
		Sent: maandag 25 mei 2020 22:12
		To: Marc Roos
		Cc: kdhall; martin.verges; sstkadu; amudhan83; ceph-users; 
doustar
		Subject: Re:  Re: [External Email] Re: Ceph 
Nautius not 

		working after setting MTU 9000

		Quick and easy depends on your network infrastructure.  
Sometimes 
	it is 
		difficult or impossible to retrofit a live cluster without 
	disruption.   

		> On May 25, 2020, at 1:03 AM, Marc Roos 
<M.Roos@xxxxxxxxxxxxxxxxx> <mailto:M.Roos@xxxxxxxxxxxxxxxxx>  

		wrote:
		> 
		> 
		> I am interested. I am always setting mtu to 9000. To be 
honest I 
		> cannot imagine there is no optimization since you have less 
	interrupt 
		> requests, and you are able x times as much data. Every time 
there 

		> something written about optimizing the first thing mention 
is 
	changing 

		> to the mtu 9000. Because it is quick and easy win.
		> 
		> 
		> 
		> 
		> -----Original Message-----
		> From: Dave Hall [mailto:kdhall@xxxxxxxxxxxxxx]
		> Sent: maandag 25 mei 2020 5:11
		> To: Martin Verges; Suresh Rama
		> Cc: Amudhan P; Khodayar Doustar; ceph-users
		> Subject:  Re: [External Email] Re: Ceph Nautius 
not 
		> working after setting MTU 9000
		> 
		> All,
		> 
		> Regarding Martin's observations about Jumbo Frames....
		> 
		> I have recently been gathering some notes from various 
internet 
		> sources regarding Linux network performance, and Linux 
	performance in 
		> general, to be applied to a Ceph cluster I manage but also 
to the 
	rest 

		> of the Linux server farm I'm responsible for.
		> 
		> In short, enabling Jumbo Frames without also tuning a number 
of 
	other 
		> kernel and NIC attributes will not provide the performance 
	increases 
		> we'd like to see.  I have not yet had a chance to go through 
the 
	rest 
		> of the testing I'd like to do, but  I can confirm (via 
iperf3) 
	that 
		> only enabling Jumbo Frames didn't make a significant 
difference.
		> 
		> Some of the other attributes I'm referring to are incoming 
and 
		> outgoing buffer sizes at the NIC, IP, and TCP levels, 
interrupt 
		> coalescing, NIC offload functions that should or shouldn't 
be 
	turned 
		> on, packet queuing disciplines (tc), the best choice of TCP 
	slow-start 

		> algorithms, and other TCP features and attributes.
		> 
		> The most off-beat item I saw was something about adding 
IPTABLES 
	rules 

		> to bypass CONNTRACK table lookups.
		> 
		> In order to do anything meaningful to assess the effect of 
all of 

		> these settings I'd like to figure out how to set them all 
via 
	Ansible 
		> - so more to learn before I can give opinions.
		> 
		> -->  If anybody has added this type of configuration to Ceph 

	Ansible,
		> I'd be glad for some pointers.
		> 
		> I have started to compile a document containing my notes.  
It's 
	rough, 

		> but I'd be glad to share if anybody is interested.
		> 
		> -Dave
		> 
		> Dave Hall
		> Binghamton University
		> 
		>> On 5/24/2020 12:29 PM, Martin Verges wrote:
		>> 
		>> Just save yourself the trouble. You won't have any real 
benefit 
	from
		> MTU
		>> 9000. It has some smallish, but it is not worth the effort, 

	problems,
		> and
		>> loss of reliability for most environments.
		>> Try it yourself and do some benchmarks, especially with 
your 
	regular 
		>> workload on the cluster (not the maximum peak performance), 
then 
	drop
		> the
		>> MTU to default ;).
		>> 
		>> Please if anyone has other real world benchmarks showing 
huge
		> differences
		>> in regular Ceph clusters, please feel free to post it here.
		>> 
		>> --
		>> Martin Verges
		>> Managing director
		>> 
		>> Mobile: +49 174 9335695
		>> E-Mail: martin.verges@xxxxxxxx
		>> Chat: https://t.me/MartinVerges
		>> 
		>> croit GmbH, Freseniusstr. 31h, 81247 Munich
		>> CEO: Martin Verges - VAT-ID: DE310638492 Com. register: 
	Amtsgericht 
		>> Munich HRB 231263
		>> 
		>> Web: https://croit.io
		>> YouTube: https://goo.gl/PGE1Bx
		>> 
		>> 
		>>> Am So., 24. Mai 2020 um 15:54 Uhr schrieb Suresh Rama
		>> <sstkadu@xxxxxxxxx> <mailto:sstkadu@xxxxxxxxx> :
		>> 
		>>> Ping with 9000 MTU won't get response as I said and it 
should 
	be
		> 8972. Glad
		>>> it is working but you should know what happened to avoid 
this 
	issue
		> later.
		>>> 
		>>>> On Sun, May 24, 2020, 3:04 AM Amudhan P 
<amudhan83@xxxxxxxxx> <mailto:amudhan83@xxxxxxxxx>  
		wrote:
		>>> 
		>>>> No, ping with MTU size 9000 didn't work.
		>>>> 
		>>>> On Sun, May 24, 2020 at 12:26 PM Khodayar Doustar
		> <doustar@xxxxxxxxxxxx> <mailto:doustar@xxxxxxxxxxxx> 
		>>>> wrote:
		>>>> 
		>>>>> Does your ping work or not?
		>>>>> 
		>>>>> 
		>>>>> On Sun, May 24, 2020 at 6:53 AM Amudhan P 
	<amudhan83@xxxxxxxxx> <mailto:amudhan83@xxxxxxxxx> 
		> wrote:
		>>>>> 
		>>>>>> Yes, I have set setting on the switch side also.
		>>>>>> 
		>>>>>> On Sat 23 May, 2020, 6:47 PM Khodayar Doustar,
		> <doustar@xxxxxxxxxxxx> <mailto:doustar@xxxxxxxxxxxx> 
		>>>>>> wrote:
		>>>>>> 
		>>>>>>> Problem should be with network. When you change MTU it 

	should be
		>>>> changed
		>>>>>>> all over the network, any single hup on your network 
should 

		>>>>>>> speak
		> and
		>>>>>>> accept 9000 MTU packets. you can check it on your 
hosts 
	with
		>>> "ifconfig"
		>>>>>>> command and there is also equivalent commands for 
other
		>>>> network/security
		>>>>>>> devices.
		>>>>>>> 
		>>>>>>> If you have just one node which it not correctly 
configured 
	for
		> MTU
		>>>> 9000
		>>>>>>> it wouldn't work.
		>>>>>>> 
		>>>>>>> On Sat, May 23, 2020 at 2:30 PM sinan@xxxxxxxx 
	<sinan@xxxxxxxx> <mailto:sinan@xxxxxxxx> 
		>>> wrote:
		>>>>>>>> Can the servers/nodes ping eachother using large 
packet 
	sizes? 
		>>>>>>>> I
		>>> guess
		>>>>>>>> not.
		>>>>>>>> 
		>>>>>>>> Sinan Polat
		>>>>>>>> 
		>>>>>>>>> Op 23 mei 2020 om 14:21 heeft Amudhan P 
	<amudhan83@xxxxxxxxx> <mailto:amudhan83@xxxxxxxxx> 
		> het
		>>>>>>>> volgende geschreven:
		>>>>>>>>> In OSD logs "heartbeat_check: no reply from OSD"
		>>>>>>>>> 
		>>>>>>>>>> On Sat, May 23, 2020 at 5:44 PM Amudhan P
		> <amudhan83@xxxxxxxxx> <mailto:amudhan83@xxxxxxxxx> 
		>>>>>>>> wrote:
		>>>>>>>>>> Hi,
		>>>>>>>>>> 
		>>>>>>>>>> I have set Network switch with MTU size 9000 and 
also in 
	my
		>>> netplan
		>>>>>>>>>> configuration.
		>>>>>>>>>> 
		>>>>>>>>>> What else needs to be checked?
		>>>>>>>>>> 
		>>>>>>>>>> 
		>>>>>>>>>>> On Sat, May 23, 2020 at 3:39 PM Wido den Hollander 
<
		>>> wido@xxxxxxxx
		>>>>>>>> wrote:
		>>>>>>>>>>> 
		>>>>>>>>>>> 
		>>>>>>>>>>>> On 5/23/20 12:02 PM, Amudhan P wrote:
		>>>>>>>>>>>> Hi,
		>>>>>>>>>>>> 
		>>>>>>>>>>>> I am using ceph Nautilus in Ubuntu 18.04 working 
fine 
	wit
		> MTU
		>>>> size
		>>>>>>>> 1500
		>>>>>>>>>>>> (default) recently i tried to update MTU size to 
9000.
		>>>>>>>>>>>> After setting Jumbo frame running ceph -s is 
timing 
	out.
		>>>>>>>>>>> Ceph can run just fine with an MTU of 9000. But 
there 
	is
		>>> probably
		>>>>>>>>>>> something else wrong on the network which is 
causing 
	this.
		>>>>>>>>>>> 
		>>>>>>>>>>> Check the Jumbo Frames settings on all the 
switches as 
	well
		> to
		>>>> make
		>>>>>>>> sure
		>>>>>>>>>>> they forward all the packets.
		>>>>>>>>>>> 
		>>>>>>>>>>> This is definitely not a Ceph issue.
		>>>>>>>>>>> 
		>>>>>>>>>>> Wido
		>>>>>>>>>>> 
		>>>>>>>>>>>> regards
		>>>>>>>>>>>> Amudhan P
		>>>>>>>>>>>> _______________________________________________
		>>>>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx To 
		>>>>>>>>>>>> unsubscribe send an email to 
ceph-users-leave@xxxxxxx
		>>>>>>>>>>>> 
		>>>>>>>>>>> _______________________________________________
		>>>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx To 
	unsubscribe 

		>>>>>>>>>>> send an email to ceph-users-leave@xxxxxxx
		>>>>>>>>>>> 
		>>>>>>>>> _______________________________________________
		>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx To 
	unsubscribe 
		>>>>>>>>> send an email to ceph-users-leave@xxxxxxx
		>>>>>>>> _______________________________________________
		>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx To 
	unsubscribe 
		>>>>>>>> send an email to ceph-users-leave@xxxxxxx
		>>>>>>>> 
		>>>> _______________________________________________
		>>>> ceph-users mailing list -- ceph-users@xxxxxxx To 
unsubscribe 
	send 
		>>>> an email to ceph-users-leave@xxxxxxx
		>>>> 
		>>> _______________________________________________
		>>> ceph-users mailing list -- ceph-users@xxxxxxx To 
unsubscribe 
	send an 

		>>> email to ceph-users-leave@xxxxxxx
		>>> 
		>> _______________________________________________
		>> ceph-users mailing list -- ceph-users@xxxxxxx To 
unsubscribe 
	send an 
		>> email to ceph-users-leave@xxxxxxx
		> _______________________________________________
		> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe 
send 
	an 
		> email to ceph-users-leave@xxxxxxx
		> 
		> _______________________________________________
		> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe 
send 
	an 
		> email to ceph-users-leave@xxxxxxx

		_______________________________________________
		ceph-users mailing list -- ceph-users@xxxxxxx
		To unsubscribe send an email to ceph-users-leave@xxxxxxx

	_______________________________________________
	ceph-users mailing list -- ceph-users@xxxxxxx
	To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx