Re: How to avoid 'bad port / jabber flood' = ceph killer?

Daniel Poelzleithner <poelzi@xxxxxxxxxx> · Thu, 27 Jan 2022 18:08:14 +0100

On 27/01/2022 16:25, Harry G. Coin wrote:

1: What's a better way at 'mid-failure diagnosis time' to know directly 
which cable to pull instead of 'one by one until the offender is found'?

2: Related, in the same spirit as ceph's 'devicehealth', is there a way 
to profile 'usual and customary' traffic then alert when a 'known 
connection' exceeds their baseline?

We had a similar but different accident some weeks ago. Due to brown 
out/blackouts one of our main switches died. We did not have a proper 
spare sitting around, but to take a different brand model.
We created a new configuration backend to configure the switch and 
everything seemed ok, but the ceph cluster was not happy at all.

Only some OSDs where up, and in general was behaving super strange.

After some investigation, I found out, that the new switch did not have 
jumpo packets enabled, which causes connections to stall as soon as more 
data got transferred. So, handshake usually worked, but then the OSD 
hang just at some more or less arbitrary point.

Ceph is quite sensitive to network problems in general and there is 
quite some possibilities for improvement.

Connection Checks for example and prioritizing of known good OSD hosts 
and fair queuing of client traffic.

kind regards
 poelzi
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx