How to avoid 'bad port / jabber flood' = ceph killer?

"Harry G. Coin" <hgcoin@xxxxxxxxx> · Thu, 27 Jan 2022 09:25:25 -0600

I would really appreciate advice because I bet many of you have 'seen 
this before' but I can't find a recipe.

There must be a 'better way' to respond to this situation:  It starts 
with a well working small ceph cluster with 5 servers and no apparent 
change to the workflow  suddenly starts reporting lagging ops on three 
or four OSDs and mon.  They wave in and out. Then a file system is 
'degraded' owing to many PGs stuck 'peering', in the 'throttle' stage, 
often for hours.  In short the whole composite platform becomes 
effectively useless. The dashboard works, the command line ops on all 
the hosts still work.  Strangely 'dd if=/dev/sd<ceph drive> of=/dev/null 
bs=4096 count=100' can take 15 seconds without regard to the drive or 
host the first call.  Lots of free space both memory and storage.  No 
hardware related drive or controller issues in any of the logs.

The problem was resolved almost immediately and all functions returned 
to normal after detaching a cable linked to a wifi access point from the 
'front side' commercial grade 32 port switch all the hosts also 
connected to.  The wifi access point was lightly loaded with clients, no 
immediately obvious new devices or 'wardrivers'.

The problem appears to be a not an abruptly failing, but slowly failing 
ethernet port and/or cable and/or "IOT" device.

1: What's a better way at 'mid-failure diagnosis time' to know directly 
which cable to pull instead of 'one by one until the offender is found'?

2: Related, in the same spirit as ceph's 'devicehealth', is there a way 
to profile 'usual and customary' traffic then alert when a 'known 
connection' exceeds their baseline?

Thanks in advance, I bet a good answer will help many people.

Harry Coin

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx