clustre monitoring

Mark Seger <mjseger@xxxxxxxxx> · Tue, 22 Feb 2011 13:39:40 -0500

A number of years ago I wrote the collectl monitoring utility which is
currently in use on a fairly large community of HPC clusters,
including many on the top500 list.  I just wanted to let you all know
that I just updated the collectl-utils package which contains a
utility called colmux that I think can revolutionize one’s ability to
see what’s going on with any resource on clusters of almost any size
having tested it on clusters of over 2000 nodes with great results.

Basically it’s a collectl multiplexor, and by that I mean it’s a
utility that starts a copy of collectl running on multiple systems,
multiplexes the output back to a single point, sorts it by a specific
column number and displays the output in a continuously refreshing
window in a top-like fashion.   It can also be used to multiplex
historical data as well.

Since it can support almost anything collectl can monitor (which is
substantial), you can quickly identify anything from a busy nfs client
or server, a slow disk anywhere in the cluster, a network interface
generating too many errors, a slow infiniband link, a system doing too
many interrupts or almost anything else you can think of.  It’s even
been used to find the systems running at the highest temperatures on a
multi-thousand node cluster during a top500 linpack run!   I’ve
successfully used it to find systems on a cluster leaking slab memory
and it only took seconds.

If you want to read more, you can read more about it on the
collectl-utils project page on sourceforge.  There’s even a nifty
photo of it running using an alternate output format displaying the
CPU load on 192 system, all on the same line once a second, taking 3
30” displays side-by-side!

-mark
--
To unsubscribe from this list: send the line "unsubscribe linux-admin" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html