Re: Check networking first?

Quentin Hartman <qhartman@xxxxxxxxxxxxxxxxxxx> · Mon, 3 Aug 2015 10:06:03 -0600

All of the other things that I would be looking at would show a link speed failure. In the two cases of network shenanigans I've had that effectively broke ceph the link speed was always correct. That leads me to distrust link speed as a reliable source of truth. Also, it's testing a proxy for what you actually care about, not the thing you actually care about. Put another way, I don't care what speed the link negotiates, I care how fast it actually moves packets.
Link usage seems like a much more interesting metric to add to me, but I would be concerned about generating a lot of false positives. If my network is usually 20% utilized, but then spikes to 100% for awhile because of some legit activity, I don't want to get an alarm for that. Maybe having a rule where it has a very long normalization period or something so it only alarms if it's pegged for multiple hours or something. But again, I would think that in that case there would be other problems that are evident because of some pathological state on the network. Again, I don't care if my network is being fully utilized, I care if the network is being utilized to the point that it's causing IO wait in VMs. Looking at utilization alone won't tell me that. But again, it could be a good canary for anomaly detection if your normal state leaves a lot of headroom. I'm torn on this one.

Also, while we're on the subject, if anyone isn't doing any kind of metric collection on their ceph networks, I highly recommend installing ganglia. IT's dead simpel to get going and creates all sorts of useful system-level graphs that are helpful in locating and identifying trends and problems.

QH

On Mon, Aug 3, 2015 at 9:45 AM, Antonio Messina <antonio.messina@xxxxxx> wrote:
On Mon, Aug 3, 2015 at 5:10 PM, Quentin Hartman

<qhartman@xxxxxxxxxxxxxxxxxxx> wrote:

> The problem with this kind of monitoring is that there are so many possible

> metrics to watch and so many possible ways to watch them. For myself, I'm

> working on implementing a couple of things:

> - Watching error counters on servers

> - Watching error counters on switches

> - Watching performance

I would also check:

- link speed (on both servers and switches)

- link usage (over 80% issue a warning)

.a.

--

antonio.messina@xxxxxx

S3IT: Services and Support for Science IT        http://www.s3it.uzh.ch/

University of Zurich                             Y12 F 84

Winterthurerstrasse 190

CH-8057 Zurich Switzerland

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com