Re: How to avoid 'bad port / jabber flood' = ceph killer?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 27/01/2022 16:25, Harry G. Coin wrote:

1: What's a better way at 'mid-failure diagnosis time' to know directly which cable to pull instead of 'one by one until the offender is found'?

2: Related, in the same spirit as ceph's 'devicehealth', is there a way to profile 'usual and customary' traffic then alert when a 'known connection' exceeds their baseline?

We had a similar but different accident some weeks ago. Due to brown out/blackouts one of our main switches died. We did not have a proper spare sitting around, but to take a different brand model. We created a new configuration backend to configure the switch and everything seemed ok, but the ceph cluster was not happy at all.

Only some OSDs where up, and in general was behaving super strange.

After some investigation, I found out, that the new switch did not have jumpo packets enabled, which causes connections to stall as soon as more data got transferred. So, handshake usually worked, but then the OSD hang just at some more or less arbitrary point.

Ceph is quite sensitive to network problems in general and there is quite some possibilities for improvement.

Connection Checks for example and prioritizing of known good OSD hosts and fair queuing of client traffic.

kind regards
 poelzi
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux