All of the other things that I would be looking at would show a link speed failure. In the two cases of network shenanigans I've had that effectively broke ceph the link speed was always correct. That leads me to distrust link speed as a reliable source of truth. Also, it's testing a proxy for what you actually care about, not the thing you actually care about. Put another way, I don't care what speed the link negotiates, I care how fast it actually moves packets.
Link usage seems like a much more interesting metric to add to me, but I would be concerned about generating a lot of false positives. If my network is usually 20% utilized, but then spikes to 100% for awhile because of some legit activity, I don't want to get an alarm for that. Maybe having a rule where it has a very long normalization period or something so it only alarms if it's pegged for multiple hours or something. But again, I would think that in that case there would be other problems that are evident because of some pathological state on the network. Again, I don't care if my network is being fully utilized, I care if the network is being utilized to the point that it's causing IO wait in VMs. Looking at utilization alone won't tell me that. But again, it could be a good canary for anomaly detection if your normal state leaves a lot of headroom. I'm torn on this one.
Also, while we're on the subject, if anyone isn't doing any kind of metric collection on their ceph networks, I highly recommend installing ganglia. IT's dead simpel to get going and creates all sorts of useful system-level graphs that are helpful in locating and identifying trends and problems.
QH
On Mon, Aug 3, 2015 at 9:45 AM, Antonio Messina <antonio.messina@xxxxxx> wrote:
On Mon, Aug 3, 2015 at 5:10 PM, Quentin Hartman
<qhartman@xxxxxxxxxxxxxxxxxxx> wrote:
> The problem with this kind of monitoring is that there are so many possible
> metrics to watch and so many possible ways to watch them. For myself, I'm
> working on implementing a couple of things:
> - Watching error counters on servers
> - Watching error counters on switches
> - Watching performance
I would also check:
- link speed (on both servers and switches)
- link usage (over 80% issue a warning)
.a.
--
antonio.messina@xxxxxx
S3IT: Services and Support for Science IT http://www.s3it.uzh.ch/
University of Zurich Y12 F 84
Winterthurerstrasse 190
CH-8057 Zurich Switzerland
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com