On 2/21/21 9:51 AM, Frank Schilder wrote:
Hi Stefan,
thanks for the additional info. Dell will put me in touch with their deployment team soonish and then I can ask about matching abilities.
It turns out that the problem I observed might have a much more profane reason. I saw really long periods with slow ping time yesterday and finally managed to pin it down to a flapping link. My best bet is that an SFP transceiver has gone bad.
What I'm really surprised about is, that the switch seems not to have any flapping detection. It happily takes the port up and down several times per second. Unfortunately, I can't find anything about server-sided flapping detection on mode=4 bonds nor for members of a LAG on the switch. Do you know of anything that does that? I might be looking for the wrong term.
Flapping detection would indeed be the thing to search for. Flaps (port
down / up) events could be trapped with SNMP. Not sure if you have a
SNMP(trap) infra in place. Otherwise LibreNMS [1] is a nice tool to set
up to gather network related info. According to a couple of forum
threads you should be able to do flapping detection and alerting based
on that [2,3]. You might also want to drop all those traps in an irc (or
matrix [4]) channel.
We have quite high redundancy. I can loose up to 3 ports on a server before the aggregated bandwidth might get too small. Therefore, I would be happy to take the occasional false positive as long as we don't miss the real flaps. Something like "permanently shut down interface if it does a down-up 3 times per second" would be perfect. Ideally without having to watch the logs.
Gr. Stefan
[1]: https://www.librenms.org/
[2]:
https://community.librenms.org/t/selected-interface-flapping-detection/10658
[3]:
https://community.librenms.org/t/alert-port-flapping-up-down-too-much/10380
[4]: https://matrix.org/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx