How to avoid 'bad port / jabber flood' = ceph killer?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I would really appreciate advice because I bet many of you have 'seen this before' but I can't find a recipe.

There must be a 'better way' to respond to this situation:  It starts with a well working small ceph cluster with 5 servers and no apparent change to the workflow  suddenly starts reporting lagging ops on three or four OSDs and mon.  They wave in and out. Then a file system is 'degraded' owing to many PGs stuck 'peering', in the 'throttle' stage, often for hours.  In short the whole composite platform becomes effectively useless. The dashboard works, the command line ops on all the hosts still work.  Strangely 'dd if=/dev/sd<ceph drive> of=/dev/null bs=4096 count=100' can take 15 seconds without regard to the drive or host the first call.  Lots of free space both memory and storage.  No hardware related drive or controller issues in any of the logs.

The problem was resolved almost immediately and all functions returned to normal after detaching a cable linked to a wifi access point from the 'front side' commercial grade 32 port switch all the hosts also connected to.  The wifi access point was lightly loaded with clients, no immediately obvious new devices or 'wardrivers'.

The problem appears to be a not an abruptly failing, but slowly failing ethernet port and/or cable and/or "IOT" device.

1: What's a better way at 'mid-failure diagnosis time' to know directly which cable to pull instead of 'one by one until the offender is found'?

2: Related, in the same spirit as ceph's 'devicehealth', is there a way to profile 'usual and customary' traffic then alert when a 'known connection' exceeds their baseline?

Thanks in advance, I bet a good answer will help many people.

Harry Coin








_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux