Re: Network design issues

Frank Schilder <frans@xxxxxx> · Sun, 21 Feb 2021 08:51:47 +0000

Hi Stefan,

thanks for the additional info. Dell will put me in touch with their deployment team soonish and then I can ask about matching abilities.

It turns out that the problem I observed might have a much more profane reason. I saw really long periods with slow ping time yesterday and finally managed to pin it down to a flapping link. My best bet is that an SFP transceiver has gone bad.

What I'm really surprised about is, that the switch seems not to have any flapping detection. It happily takes the port up and down several times per second. Unfortunately, I can't find anything about server-sided flapping detection on mode=4 bonds nor for members of a LAG on the switch. Do you know of anything that does that? I might be looking for the wrong term.

We have quite high redundancy. I can loose up to 3 ports on a server before the aggregated bandwidth might get too small. Therefore, I would be happy to take the occasional false positive as long as we don't miss the real flaps. Something like "permanently shut down interface if it does a down-up 3 times per second" would be perfect. Ideally without having to watch the logs.

For the future, I plan to go 25G active-passive without preferred port. This config will handle flapping gracefully.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Stefan Kooman <stefan@xxxxxx>
Sent: 15 February 2021 21:24:09
To: Frank Schilder
Cc: ceph-users@xxxxxxx
Subject: Re:  Re: Network design issues

On 2/15/21 5:38 PM, Frank Schilder wrote:
> Hi Stefan,
>
> I think you gave me the right pointers.
>
> Last summer I was looking up exactly this, how do Dell switches hash connections onto members of a LAG. What I found was, that the only option was by MAC. I did a test with iperf using several connections between the same two servers, or from one to many. This test confirmed what I found in the documentation, all connections between 2 servers shared a single 10G member, while one-to-many connections were distributed over multiple members. Back then, I thought this was it and didn't look into this further.
>
> Now, after your hints, I went back to the manual and find that the switches actually do support more advanced hash functions - at least after enabling ECMP. By default it is disabled. I'm not sure if I was reading a manual for the wrong switch family, no idea where I found "MAC only" statement. I got in touch with Dell support to help me here, the manual on load balancing is not exactly great.
>
> I can use MACs, IP, port, VLAN ID and a few other packet fields for hashing. I hope not only in layer 3 routing. In particular, including the VLAN ID should help spreading client and replication traffic out a bit better. And Dell also supports defining salts to avoid polarisation, which I believe is hurting us as well at the moment.
>
> I have one last question. The Dell manual states that one can enable monitoring of load balancing and it will check every 15secs for imbalance across the members of a LAG. You wrote "... and with OVS you can balance the load between the LACP links (by default it evaluates every 10 seconds if it should move flows around)." How is this done? The hash function doesn't change, so how can port mappings be re-arranged in a predictable way? The Dell switches will only create log events, nothing more. The Dell manual uses the term "dynamic load balancing", but generating log messages is not really the same. Am  missing something?

When the workload is perfectly static, nothing changes. But that will
hardly every be the case. Here the info for OVS on this:

"Every 10 seconds, vswitchd rebalances the bond members (see
bond_rebalance()). To rebalance, vswitchd examines the statistics for
the number of bytes transmitted by each member over approximately the
past minute, with data sent more recently weighted more heavily than
data sent less recently. It considers each of the members in order from
most-loaded to least-loaded. If highly loaded member H is significantly
more heavily loaded than the least-loaded member L, and member H carries
at least two hashes, then vswitchd shifts one of H’s hashes to L.
However, vswitchd will only shift a hash from H to L if it will decrease
the ratio of the load between H and L by at least 0.1.

Currently, “significantly more loaded” means that H must carry at least
1 Mbps more traffic, and that traffic must be at least 3% greater than L’s."

So if it makes sense to move one or more flows on other links, it will
do so.

I guess the Dell switches will do something similar.

>
> For us, I think a bit more clever hashing and, maybe, higher priority for the replication VLAN will do. As far as I can see, our cluster is essentially running on 10G internally and anything better than that should do and be easy to achieve.
>
> Thanks for putting me on the right track.

Good to hear, I hope you manage to solve it.

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx