On 27/01/2022 16:25, Harry G. Coin wrote:
1: What's a better way at 'mid-failure diagnosis time' to know directly
which cable to pull instead of 'one by one until the offender is found'?
2: Related, in the same spirit as ceph's 'devicehealth', is there a way
to profile 'usual and customary' traffic then alert when a 'known
connection' exceeds their baseline?
We had a similar but different accident some weeks ago. Due to brown
out/blackouts one of our main switches died. We did not have a proper
spare sitting around, but to take a different brand model.
We created a new configuration backend to configure the switch and
everything seemed ok, but the ceph cluster was not happy at all.
Only some OSDs where up, and in general was behaving super strange.
After some investigation, I found out, that the new switch did not have
jumpo packets enabled, which causes connections to stall as soon as more
data got transferred. So, handshake usually worked, but then the OSD
hang just at some more or less arbitrary point.
Ceph is quite sensitive to network problems in general and there is
quite some possibilities for improvement.
Connection Checks for example and prioritizing of known good OSD hosts
and fair queuing of client traffic.
kind regards
poelzi
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx