Re: Fwd: Re: Issues with Ceph network redundancy using L2 MC-LAG

Frank Schilder <frans@xxxxxx> · Thu, 22 Jul 2021 08:57:18 +0000

Hi, I'm a bit late to the party. I use 6x10G active/active LACP bonds on Dell switches and servers and also observe very bad behaviour when a link is flapping. First, I get "long ping time" warnings and a lot of ops are stuck. It usually takes several minutes until the kernel/switch starts detecting the flapping link. Then I shut down the interface and everything goes back to normal.

Originally, I thought that active/active will deal with that in a nice way, but it doesn't. On top of that, because some paths of the bond are open, it is extremely difficult to pinpoint which host and interface is the culprit. I would much rather prefer if an entire host went off-line than this unclear error state with long periods of service outage. Therefore, in the future I will not use link aggregation any more and rather go for 25G active/passive with the intent to be able to do maintenance but not to pseudo-increase bandwidth. With a single link fail it will be much easier to handle the situation for ceph and me as an admin.

Partial traffic loss while heartbeats more or less managing to get through seems much worse than cutting off all traffic to a single host. At least the ceph warnings will be more targeted and helpful.

With the current set-up what helped a bit is configuring link dampening. On the Dell switches, this can be used to suppress flapping ports for a reasonable amount of time. I use the setting "dampening 15 250 2500 86400" with some success. It helps with failing transceivers/ports. Typically, the failing link is suppressed before users start creating support tickets.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Andrew Walker-Brown <andrew_jbrown@xxxxxxxxxxx>
Sent: 16 June 2021 10:18:02
To: huxiaoyu@xxxxxxxxxxxx
Cc: Joe Comeau; ceph-users
Subject:  Re: Fwd: Re: Issues with Ceph network redundancy using L2 MC-LAG

Depends on when you configure the switch port. For dell :

Interface Ethernet 1/1/20
No switchport
Channel-group 10 mode active
!

‘Mode active’ set it as a dynamic lacp lag. Otherwise it would be ‘mode static’

With active mode, you then have a transmit hashing policy, usually set globally.

On Linux the bond would be set as ‘bond-mode 802.3ad’ and then ‘bond-xmit-hash-policy layer3+4’ - or whatever hashing policy you want.

Sent from my iPhone

On 16 Jun 2021, at 08:57, huxiaoyu@xxxxxxxxxxxx wrote:

Is it true that MC-LAG and 803.2ad, by its default, are working on active-active.

What else should i take care to ensure fault tolerance when one path is bad?

best regards,

samuel

huxiaoyu@xxxxxxxxxxxx

From: Joe Comeau
Date: 2021-06-15 23:44
To: ceph-users@xxxxxxx
Subject:  Fwd: Re: Issues with Ceph network redundancy using L2 MC-LAG
We also run with Dell VLT switches (40 GB)
everything is active/active, so multiple paths as Andrew describes in
his config
Our config allows us:
  bring down one of the switches for upgrades
  bring down an iscsi gatway for patching
all the while at least one path is up and servicing
Thanks Joe

>>> Andrew Walker-Brown <andrew_jbrown@xxxxxxxxxxx> 6/15/2021 10:26 AM
>>>
With an unstable link/port you could see the issues you describe.  Ping
doesn’t have the packet rate for you to necessarily have a packet in
transit at exactly the same time as the port fails temporarily.  Iperf
on the other hand could certainly show the issue, higher packet rate and
more likely to have packets in flight at the time of a link
fail...combined with packet loss/retries gives poor throughput.

Depending on what you want to happen, there are a number of tuning
options both on the switches and Linux.  If you want the LAG to be down
if any link fails, the you should be able to config this on the switches
and/or Linux  (minimum number of links = 2 if you have 2 links in the
lag).

You can also tune the link monitoring, how frequently the links are
checked (e.g. miimon) etc.  Bringing this value down from the default of
100ms may allow you to detect a link failure more quickly.  But you then
run into the chance if detecting a transient failure that wouldn’t have
caused any issues....and the LAG becoming more unstable.

Flapping/unstable links are the worst kind of situation.  Ideally you’d
pick that up quickly from monitoring/alerts and either fix immediately
or take the link down until you can fix it.

I run 2x10G from my hosts into separate switches (Dell S series – VLT
between switches).  Pulling a single interface has no impact on Ceph,
any packet loss is tiny and we’re not exceeding 10G bandwidth per host.

If you’re running 1G links and the LAG is already busy, a link failure
could be causing slow writes to the host, just down to
congestion...which then starts to impact the wider cluster based on how
Ceph works.

Just caveating the above with - I’m relatively new to Ceph myself....

Sent from Mail<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgo.microsoft.com%2Ffwlink%2F%3FLinkId%3D550986&amp;data=04%7C01%7C%7Cf6c3c985114b423fb75e08d9309c5248%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637594270222669937%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=ZrCWPvW06jX69X8PWR0S1tBxhQlzKPQnix4wzBMoN4s%3D&amp;reserved=0> for
Windows 10

From: huxiaoyu@xxxxxxxxxxxx<mailto:huxiaoyu@xxxxxxxxxxxx>
Sent: 15 June 2021 17:52
To: Serkan Çoban<mailto:cobanserkan@xxxxxxxxx>
Cc: ceph-users<mailto:ceph-users@xxxxxxx>
Subject:  Re: Issues with Ceph network redundancy using L2
MC-LAG

When i pull out the cable, then the bond is working properly.

Does it mean that the port is somehow flapping? Ping can still work,
but the iperf test yields very low results.

huxiaoyu@xxxxxxxxxxxx

From: Serkan Çoban
Date: 2021-06-15 18:47
To: huxiaoyu@xxxxxxxxxxxx
CC: ceph-users
Subject: Re:  Issues with Ceph network redundancy using L2
MC-LAG
Do you observe the same behaviour when you pull a cable?
Maybe a flapping port might cause this kind of behaviour, other than
that you should't see any network disconnects.
Are you sure about LACP configuration, what is the output of 'cat
/proc/net/bonding/bond0'

On Tue, Jun 15, 2021 at 7:19 PM huxiaoyu@xxxxxxxxxxxx
<huxiaoyu@xxxxxxxxxxxx> wrote:
>
> Dear Cephers,
>
> I encountered the following networking issue several times, and i
wonder whether there is a solution for networking HA solution.
>
> We build ceph using L2 multi chassis link aggregation group (MC-LAG )
to provide switch redundancy. On each host, we use 802.3ad, LACP
> mode for NIC redundancy. However, we observe several times, when a
single network port, either the cable, or the SFP+ optical module fails,
Ceph cluster  is badly affected by networking, although in theory it
should be able to tolerate.
>
> Did i miss something important here? and how to really achieve
networking HA in Ceph cluster?
>
> best regards,
>
> Samuel
>
>
>
>
> huxiaoyu@xxxxxxxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx

> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx