Re: Check networking first?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



All of the other things that I would be looking at would show a link speed failure. In the two cases of network shenanigans I've had that effectively broke ceph the link speed was always correct. That leads me to distrust link speed as a reliable source of truth. Also, it's testing a proxy for what you actually care about, not the thing you actually care about. Put another way, I don't care what speed the link negotiates, I care how fast it actually moves packets.

Link usage seems like a much more interesting metric to add to me, but I would be concerned about generating a lot of false positives. If my network is usually 20% utilized, but then spikes to 100% for awhile because of some legit activity, I don't want to get an alarm for that. Maybe having a rule where it has a very long normalization period or something so it only alarms if it's pegged for multiple hours or something. But again, I would think that in that case there would be other problems that are evident because of some pathological state on the network. Again, I don't care if my network is being fully utilized, I care if the network is being utilized to the point that it's causing IO wait in VMs. Looking at utilization alone won't tell me that. But again, it could be a good canary for anomaly detection if your normal state leaves a lot of headroom. I'm torn on this one.

Also, while we're on the subject, if anyone isn't doing any kind of metric collection on their ceph networks, I highly recommend installing ganglia. IT's dead simpel to get going and creates all sorts of useful system-level graphs that are helpful in locating and identifying trends and problems.

QH

On Mon, Aug 3, 2015 at 9:45 AM, Antonio Messina <antonio.messina@xxxxxx> wrote:
On Mon, Aug 3, 2015 at 5:10 PM, Quentin Hartman
<qhartman@xxxxxxxxxxxxxxxxxxx> wrote:
> The problem with this kind of monitoring is that there are so many possible
> metrics to watch and so many possible ways to watch them. For myself, I'm
> working on implementing a couple of things:
> - Watching error counters on servers
> - Watching error counters on switches
> - Watching performance

I would also check:

- link speed (on both servers and switches)
- link usage (over 80% issue a warning)

.a.

--
antonio.messina@xxxxxx
S3IT: Services and Support for Science IT        http://www.s3it.uzh.ch/
University of Zurich                             Y12 F 84
Winterthurerstrasse 190
CH-8057 Zurich Switzerland

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux