I would really appreciate advice because I bet many of you have 'seen
this before' but I can't find a recipe.
There must be a 'better way' to respond to this situation: It starts
with a well working small ceph cluster with 5 servers and no apparent
change to the workflow suddenly starts reporting lagging ops on three
or four OSDs and mon. They wave in and out. Then a file system is
'degraded' owing to many PGs stuck 'peering', in the 'throttle' stage,
often for hours. In short the whole composite platform becomes
effectively useless. The dashboard works, the command line ops on all
the hosts still work. Strangely 'dd if=/dev/sd<ceph drive> of=/dev/null
bs=4096 count=100' can take 15 seconds without regard to the drive or
host the first call. Lots of free space both memory and storage. No
hardware related drive or controller issues in any of the logs.
The problem was resolved almost immediately and all functions returned
to normal after detaching a cable linked to a wifi access point from the
'front side' commercial grade 32 port switch all the hosts also
connected to. The wifi access point was lightly loaded with clients, no
immediately obvious new devices or 'wardrivers'.
The problem appears to be a not an abruptly failing, but slowly failing
ethernet port and/or cable and/or "IOT" device.
1: What's a better way at 'mid-failure diagnosis time' to know directly
which cable to pull instead of 'one by one until the offender is found'?
2: Related, in the same spirit as ceph's 'devicehealth', is there a way
to profile 'usual and customary' traffic then alert when a 'known
connection' exceeds their baseline?
Thanks in advance, I bet a good answer will help many people.
Harry Coin
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx