Re: [ceph-users]: Ceph Nautius not working after setting MTU 9000

Paul Emmerich <paul.emmerich@xxxxxxxx> · Fri, 29 May 2020 12:29:13 +0200

Please do not apply any optimization without benchmarking *before* and
*after* in a somewhat realistic scenario.

No, iperf is likely not a realistic setup because it will usually be
limited by available network bandwidth which is (should) rarely be maxed
out on your actual Ceph setup.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, May 29, 2020 at 2:15 AM Dave Hall <kdhall@xxxxxxxxxxxxxx> wrote:

> Hello.
>
> A few days ago I offered to share the notes I've compiled on network
> tuning.  Right now it's a Google Doc:
>
>
> https://docs.google.com/document/d/1nB5fzIeSgQF0ti_WN-tXhXAlDh8_f8XF9GhU7J1l00g/edit?usp=sharing
>
> I've set it up to allow comments and I'd be glad for questions and
> feedback.  If Google Docs not an acceptable format I'll try to put it up
> somewhere as HTML or Wiki.  Disclosure: some sections were copied
> verbatim from other sources.
>
> Regarding the current discussion about iperf, the likely bottleneck is
> buffering.  There is a per-NIC output queue set with 'ip link' and a per
> CPU core input queue set with 'sysctl'.  Both should be set to some
> multiple of the frame size based on calculations related to link speed
> and latency.  Jumping from 1500 to 9000 could negatively impact
> performance because one buffer or the other might be 1500 bytes short of
> a low multiple of 9000.
>
> It would be interesting to see the iperf tests repeated with
> corresponding buffer sizing.  I will perform this experiment as soon as
> I complete some day-job tasks.
>
> -Dave
>
> Dave Hall
> Binghamton University
> kdhall@xxxxxxxxxxxxxx
> 607-760-2328 (Cell)
> 607-777-4641 (Office)
>
> On 5/27/2020 6:51 AM, EDH - Manuel Rios wrote:
> > Anyone can share their table with other MTU values?
> >
> > Also interested into Switch CPU load
> >
> > KR,
> > Manuel
> >
> > -----Mensaje original-----
> > De: Marc Roos <M.Roos@xxxxxxxxxxxxxxxxx>
> > Enviado el: miércoles, 27 de mayo de 2020 12:01
> > Para: chris.palmer <chris.palmer@xxxxxxxxx>; paul.emmerich <
> paul.emmerich@xxxxxxxx>
> > CC: amudhan83 <amudhan83@xxxxxxxxx>; anthony.datri <
> anthony.datri@xxxxxxxxx>; ceph-users <ceph-users@xxxxxxx>; doustar <
> doustar@xxxxxxxxxxxx>; kdhall <kdhall@xxxxxxxxxxxxxx>; sstkadu <
> sstkadu@xxxxxxxxx>
> > Asunto:  Re: [External Email] Re: Ceph Nautius not working
> after setting MTU 9000
> >
> >
> > Interesting table. I have this on a production cluster 10gbit at a
> > datacenter (obviously doing not that much).
> >
> >
> > [@]# iperf3 -c 10.0.0.13 -P 1 -M 9000
> > Connecting to host 10.0.0.13, port 5201
> > [  4] local 10.0.0.14 port 52788 connected to 10.0.0.13 port 5201
> > [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
> > [  4]   0.00-1.00   sec  1.14 GBytes  9.77 Gbits/sec    0    690 KBytes
> > [  4]   1.00-2.00   sec  1.15 GBytes  9.90 Gbits/sec    0   1.08 MBytes
> > [  4]   2.00-3.00   sec  1.15 GBytes  9.88 Gbits/sec    0   1.08 MBytes
> > [  4]   3.00-4.00   sec  1.15 GBytes  9.88 Gbits/sec    0   1.08 MBytes
> > [  4]   4.00-5.00   sec  1.15 GBytes  9.88 Gbits/sec    0   1.08 MBytes
> > [  4]   5.00-6.00   sec  1.15 GBytes  9.90 Gbits/sec    0   1.21 MBytes
> > [  4]   6.00-7.00   sec  1.15 GBytes  9.89 Gbits/sec    0   1.21 MBytes
> > [  4]   7.00-8.00   sec  1.15 GBytes  9.88 Gbits/sec    0   1.21 MBytes
> > [  4]   8.00-9.00   sec  1.15 GBytes  9.89 Gbits/sec    0   1.21 MBytes
> > [  4]   9.00-10.00  sec  1.15 GBytes  9.89 Gbits/sec    0   1.21 MBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bandwidth       Retr
> > [  4]   0.00-10.00  sec  11.5 GBytes  9.87 Gbits/sec    0
> > sender
> > [  4]   0.00-10.00  sec  11.5 GBytes  9.87 Gbits/sec
> > receiver
> >
> >
> > -----Original Message-----
> > Subject: Re:  Re: [External Email] Re: Ceph Nautius not
> > working after setting MTU 9000
> >
> > To elaborate on some aspects that have been mentioned already and add
> > some others::
> >
> >
> > *     Test using iperf3.
> >
> > *     Don't try to use jumbos on networks where you don't have complete
> > control over every host. This usually includes the main ceph network.
> > It's just too much grief. You can consider using it for limited-access
> > networks (e.g. ceph cluster network, hypervisor migration network, etc)
> > where you know every switch & host is tuned correctly. (This works even
> > when those nets share a vlan trunk with non-jumbo vlans - just set the
> > max value on the trunk itself, and individual values on each vlan.)
> >
> > *     If you are pinging make sure it doesn't fragment otherwise you
> > will get misleading results: e.g. ping -M do -s 9000 x.x.x.x
> > *     Do not assume that 9000 is the best value. It depends on your
> > NICs, your switch, kernel/device parameters, etc. Try different values
> > (using iperf3). As an example the results below are using a small cheap
> > Mikrotek 10G switch and HPE 10G NICs. It highlights how in this
> > configuration 9000 is worse than 1500, but that 5139 is optimal yet 5140
> > is worst. The same pattern (obviously with different values) was
> > apparent when multiple tests were run concurrently. Always test your own
> > network in a controlled manner. And of course if you introduce anything
> > different later on, test again. With enterprise-grade kit this might not
> > be so common, but always test if you fiddle.
> >
> >
> > MTU  Gbps  (actual data transfer values using iperf3)  - one particular
> > configuration only
> >
> > 9600 8.91 (max value)
> > 9000 8.91
> > 8000 8.91
> > 7000 8.91
> > 6000 8.91
> > 5500 8.17
> > 5200 7.71
> > 5150 7.64
> > 5140 7.62
> > 5139 9.81 (optimal)
> > 5138 9.81
> > 5137 9.81
> > 5135 9.81
> > 5130 9.81
> > 5120 9.81
> > 5100 9.81
> > 5000 9.81
> > 4000 9.76
> > 3000 9.68
> > 2000 9.28
> > 1500 9.37 (default)
> >
> >
> > Whether any of this will make a tangible difference for ceph is moot. I
> > just spend a little time getting the network stack correct as above,
> > then leave it. That way I know I am probably getting some benefit, and
> > not doing any harm. If you blindly change things you may well do harm
> > that can manifest itself in all sorts of ways outside of Ceph. Getting
> > some test results for this using Ceph will be easy; getting MEANINGFUL
> > results that way will be hard.
> >
> >
> > Chris
> >
> >
> > On 27/05/2020 09:25, Marc Roos wrote:
> >
> >
> >
> >
> >       I would not call a ceph page, a random tuning tip. At least I hope
> > they
> >       are not. NVMe-only with 100Gbit is not really a standard setup. I
> > assume
> >       with such setup you have the luxury to not notice many
> > optimizations.
> >
> >       What I mostly read is that changing to mtu 9000 will allow you to
> > better
> >       saturate the 10Gbit adapter, and I expect this to show on a low end
> > busy
> >       cluster. Don't you have any test results of such a setup?
> >
> >
> >
> >
> >       -----Original Message-----
> >
> >       Subject: Re:  Re: [External Email] Re: Ceph Nautius not
> >
> >       working after setting MTU 9000
> >
> >       Don't optimize stuff without benchmarking *before and after*, don't
> >
> >       apply random tuning tipps from the Internet without benchmarking
> > them.
> >
> >       My experience with Jumbo frames: 3% performance. On a NVMe-only
> > setup
> >       with 100 Gbit/s network.
> >
> >       Paul
> >
> >
> >       --
> >       Paul Emmerich
> >
> >       Looking for help with your Ceph cluster? Contact us at
> > https://croit.io
> >
> >       croit GmbH
> >       Freseniusstr. 31h
> >       81247 München
> >       www.croit.io
> >       Tel: +49 89 1896585 90
> >
> >       On Tue, May 26, 2020 at 7:02 PM Marc Roos
> > <M.Roos@xxxxxxxxxxxxxxxxx> <mailto:M.Roos@xxxxxxxxxxxxxxxxx>
> >       wrote:
> >
> >
> >
> >
> >               Look what I have found!!! :)
> >               https://ceph.com/geen-categorie/ceph-loves-jumbo-frames/
> >
> >
> >
> >               -----Original Message-----
> >               From: Anthony D'Atri [mailto:anthony.datri@xxxxxxxxx]
> >               Sent: maandag 25 mei 2020 22:12
> >               To: Marc Roos
> >               Cc: kdhall; martin.verges; sstkadu; amudhan83; ceph-users;
> > doustar
> >               Subject: Re:  Re: [External Email] Re: Ceph
> > Nautius not
> >
> >               working after setting MTU 9000
> >
> >               Quick and easy depends on your network infrastructure.
> > Sometimes
> >       it is
> >               difficult or impossible to retrofit a live cluster without
> >       disruption.
> >
> >
> >               > On May 25, 2020, at 1:03 AM, Marc Roos
> > <M.Roos@xxxxxxxxxxxxxxxxx> <mailto:M.Roos@xxxxxxxxxxxxxxxxx>
> >
> >               wrote:
> >               >
> >               > 
> >               > I am interested. I am always setting mtu to 9000. To be
> > honest I
> >               > cannot imagine there is no optimization since you have
> less
> >       interrupt
> >               > requests, and you are able x times as much data. Every
> time
> > there
> >
> >               > something written about optimizing the first thing
> mention
> > is
> >       changing
> >
> >               > to the mtu 9000. Because it is quick and easy win.
> >               >
> >               >
> >               >
> >               >
> >               > -----Original Message-----
> >               > From: Dave Hall [mailto:kdhall@xxxxxxxxxxxxxx]
> >               > Sent: maandag 25 mei 2020 5:11
> >               > To: Martin Verges; Suresh Rama
> >               > Cc: Amudhan P; Khodayar Doustar; ceph-users
> >               > Subject:  Re: [External Email] Re: Ceph
> Nautius
> > not
> >               > working after setting MTU 9000
> >               >
> >               > All,
> >               >
> >               > Regarding Martin's observations about Jumbo Frames....
> >               >
> >               > I have recently been gathering some notes from various
> > internet
> >               > sources regarding Linux network performance, and Linux
> >       performance in
> >               > general, to be applied to a Ceph cluster I manage but
> also
> > to the
> >       rest
> >
> >               > of the Linux server farm I'm responsible for.
> >               >
> >               > In short, enabling Jumbo Frames without also tuning a
> number
> > of
> >       other
> >               > kernel and NIC attributes will not provide the
> performance
> >       increases
> >               > we'd like to see.  I have not yet had a chance to go
> through
> > the
> >       rest
> >               > of the testing I'd like to do, but  I can confirm (via
> > iperf3)
> >       that
> >               > only enabling Jumbo Frames didn't make a significant
> > difference.
> >               >
> >               > Some of the other attributes I'm referring to are
> incoming
> > and
> >               > outgoing buffer sizes at the NIC, IP, and TCP levels,
> > interrupt
> >               > coalescing, NIC offload functions that should or
> shouldn't
> > be
> >       turned
> >               > on, packet queuing disciplines (tc), the best choice of
> TCP
> >       slow-start
> >
> >               > algorithms, and other TCP features and attributes.
> >               >
> >               > The most off-beat item I saw was something about adding
> > IPTABLES
> >       rules
> >
> >               > to bypass CONNTRACK table lookups.
> >               >
> >               > In order to do anything meaningful to assess the effect
> of
> > all of
> >
> >               > these settings I'd like to figure out how to set them all
> > via
> >       Ansible
> >               > - so more to learn before I can give opinions.
> >               >
> >               > -->  If anybody has added this type of configuration to
> Ceph
> >
> >       Ansible,
> >               > I'd be glad for some pointers.
> >               >
> >               > I have started to compile a document containing my notes.
> > It's
> >       rough,
> >
> >               > but I'd be glad to share if anybody is interested.
> >               >
> >               > -Dave
> >               >
> >               > Dave Hall
> >               > Binghamton University
> >               >
> >               >> On 5/24/2020 12:29 PM, Martin Verges wrote:
> >               >>
> >               >> Just save yourself the trouble. You won't have any real
> > benefit
> >       from
> >               > MTU
> >               >> 9000. It has some smallish, but it is not worth the
> effort,
> >
> >       problems,
> >               > and
> >               >> loss of reliability for most environments.
> >               >> Try it yourself and do some benchmarks, especially with
> > your
> >       regular
> >               >> workload on the cluster (not the maximum peak
> performance),
> > then
> >       drop
> >               > the
> >               >> MTU to default ;).
> >               >>
> >               >> Please if anyone has other real world benchmarks showing
> > huge
> >               > differences
> >               >> in regular Ceph clusters, please feel free to post it
> here.
> >               >>
> >               >> --
> >               >> Martin Verges
> >               >> Managing director
> >               >>
> >               >> Mobile: +49 174 9335695
> >               >> E-Mail: martin.verges@xxxxxxxx
> >               >> Chat: https://t.me/MartinVerges
> >               >>
> >               >> croit GmbH, Freseniusstr. 31h, 81247 Munich
> >               >> CEO: Martin Verges - VAT-ID: DE310638492 Com. register:
> >       Amtsgericht
> >               >> Munich HRB 231263
> >               >>
> >               >> Web: https://croit.io
> >               >> YouTube: https://goo.gl/PGE1Bx
> >               >>
> >               >>
> >               >>> Am So., 24. Mai 2020 um 15:54 Uhr schrieb Suresh Rama
> >               >> <sstkadu@xxxxxxxxx> <mailto:sstkadu@xxxxxxxxx> :
> >               >>
> >               >>> Ping with 9000 MTU won't get response as I said and it
> > should
> >       be
> >               > 8972. Glad
> >               >>> it is working but you should know what happened to
> avoid
> > this
> >       issue
> >               > later.
> >               >>>
> >               >>>> On Sun, May 24, 2020, 3:04 AM Amudhan P
> > <amudhan83@xxxxxxxxx> <mailto:amudhan83@xxxxxxxxx>
> >               wrote:
> >               >>>
> >               >>>> No, ping with MTU size 9000 didn't work.
> >               >>>>
> >               >>>> On Sun, May 24, 2020 at 12:26 PM Khodayar Doustar
> >               > <doustar@xxxxxxxxxxxx> <mailto:doustar@xxxxxxxxxxxx>
> >               >>>> wrote:
> >               >>>>
> >               >>>>> Does your ping work or not?
> >               >>>>>
> >               >>>>>
> >               >>>>> On Sun, May 24, 2020 at 6:53 AM Amudhan P
> >       <amudhan83@xxxxxxxxx> <mailto:amudhan83@xxxxxxxxx>
> >               > wrote:
> >               >>>>>
> >               >>>>>> Yes, I have set setting on the switch side also.
> >               >>>>>>
> >               >>>>>> On Sat 23 May, 2020, 6:47 PM Khodayar Doustar,
> >               > <doustar@xxxxxxxxxxxx> <mailto:doustar@xxxxxxxxxxxx>
> >               >>>>>> wrote:
> >               >>>>>>
> >               >>>>>>> Problem should be with network. When you change
> MTU it
> >
> >       should be
> >               >>>> changed
> >               >>>>>>> all over the network, any single hup on your
> network
> > should
> >
> >               >>>>>>> speak
> >               > and
> >               >>>>>>> accept 9000 MTU packets. you can check it on your
> > hosts
> >       with
> >               >>> "ifconfig"
> >               >>>>>>> command and there is also equivalent commands for
> > other
> >               >>>> network/security
> >               >>>>>>> devices.
> >               >>>>>>>
> >               >>>>>>> If you have just one node which it not correctly
> > configured
> >       for
> >               > MTU
> >               >>>> 9000
> >               >>>>>>> it wouldn't work.
> >               >>>>>>>
> >               >>>>>>> On Sat, May 23, 2020 at 2:30 PM sinan@xxxxxxxx
> >       <sinan@xxxxxxxx> <mailto:sinan@xxxxxxxx>
> >               >>> wrote:
> >               >>>>>>>> Can the servers/nodes ping eachother using large
> > packet
> >       sizes?
> >               >>>>>>>> I
> >               >>> guess
> >               >>>>>>>> not.
> >               >>>>>>>>
> >               >>>>>>>> Sinan Polat
> >               >>>>>>>>
> >               >>>>>>>>> Op 23 mei 2020 om 14:21 heeft Amudhan P
> >       <amudhan83@xxxxxxxxx> <mailto:amudhan83@xxxxxxxxx>
> >               > het
> >               >>>>>>>> volgende geschreven:
> >               >>>>>>>>> In OSD logs "heartbeat_check: no reply from OSD"
> >               >>>>>>>>>
> >               >>>>>>>>>> On Sat, May 23, 2020 at 5:44 PM Amudhan P
> >               > <amudhan83@xxxxxxxxx> <mailto:amudhan83@xxxxxxxxx>
> >               >>>>>>>> wrote:
> >               >>>>>>>>>> Hi,
> >               >>>>>>>>>>
> >               >>>>>>>>>> I have set Network switch with MTU size 9000 and
> > also in
> >       my
> >               >>> netplan
> >               >>>>>>>>>> configuration.
> >               >>>>>>>>>>
> >               >>>>>>>>>> What else needs to be checked?
> >               >>>>>>>>>>
> >               >>>>>>>>>>
> >               >>>>>>>>>>> On Sat, May 23, 2020 at 3:39 PM Wido den
> Hollander
> > <
> >               >>> wido@xxxxxxxx
> >               >>>>>>>> wrote:
> >               >>>>>>>>>>>
> >               >>>>>>>>>>>
> >               >>>>>>>>>>>> On 5/23/20 12:02 PM, Amudhan P wrote:
> >               >>>>>>>>>>>> Hi,
> >               >>>>>>>>>>>>
> >               >>>>>>>>>>>> I am using ceph Nautilus in Ubuntu 18.04
> working
> > fine
> >       wit
> >               > MTU
> >               >>>> size
> >               >>>>>>>> 1500
> >               >>>>>>>>>>>> (default) recently i tried to update MTU size
> to
> > 9000.
> >               >>>>>>>>>>>> After setting Jumbo frame running ceph -s is
> > timing
> >       out.
> >               >>>>>>>>>>> Ceph can run just fine with an MTU of 9000. But
> > there
> >       is
> >               >>> probably
> >               >>>>>>>>>>> something else wrong on the network which is
> > causing
> >       this.
> >               >>>>>>>>>>>
> >               >>>>>>>>>>> Check the Jumbo Frames settings on all the
> > switches as
> >       well
> >               > to
> >               >>>> make
> >               >>>>>>>> sure
> >               >>>>>>>>>>> they forward all the packets.
> >               >>>>>>>>>>>
> >               >>>>>>>>>>> This is definitely not a Ceph issue.
> >               >>>>>>>>>>>
> >               >>>>>>>>>>> Wido
> >               >>>>>>>>>>>
> >               >>>>>>>>>>>> regards
> >               >>>>>>>>>>>> Amudhan P
> >               >>>>>>>>>>>>
> _______________________________________________
> >               >>>>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
> To
> >               >>>>>>>>>>>> unsubscribe send an email to
> > ceph-users-leave@xxxxxxx
> >               >>>>>>>>>>>>
> >               >>>>>>>>>>> _______________________________________________
> >               >>>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
> To
> >       unsubscribe
> >
> >               >>>>>>>>>>> send an email to ceph-users-leave@xxxxxxx
> >               >>>>>>>>>>>
> >               >>>>>>>>> _______________________________________________
> >               >>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx To
> >       unsubscribe
> >               >>>>>>>>> send an email to ceph-users-leave@xxxxxxx
> >               >>>>>>>> _______________________________________________
> >               >>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx To
> >       unsubscribe
> >               >>>>>>>> send an email to ceph-users-leave@xxxxxxx
> >               >>>>>>>>
> >               >>>> _______________________________________________
> >               >>>> ceph-users mailing list -- ceph-users@xxxxxxx To
> > unsubscribe
> >       send
> >               >>>> an email to ceph-users-leave@xxxxxxx
> >               >>>>
> >               >>> _______________________________________________
> >               >>> ceph-users mailing list -- ceph-users@xxxxxxx To
> > unsubscribe
> >       send an
> >
> >               >>> email to ceph-users-leave@xxxxxxx
> >               >>>
> >               >> _______________________________________________
> >               >> ceph-users mailing list -- ceph-users@xxxxxxx To
> > unsubscribe
> >       send an
> >               >> email to ceph-users-leave@xxxxxxx
> >               > _______________________________________________
> >               > ceph-users mailing list -- ceph-users@xxxxxxx To
> unsubscribe
> > send
> >       an
> >               > email to ceph-users-leave@xxxxxxx
> >               >
> >               > _______________________________________________
> >               > ceph-users mailing list -- ceph-users@xxxxxxx To
> unsubscribe
> > send
> >       an
> >               > email to ceph-users-leave@xxxxxxx
> >
> >               _______________________________________________
> >               ceph-users mailing list -- ceph-users@xxxxxxx
> >               To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> >
> >       _______________________________________________
> >       ceph-users mailing list -- ceph-users@xxxxxxx
> >       To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx