A number of years ago I wrote the collectl monitoring utility which is currently in use on a fairly large community of HPC clusters, including many on the top500 list. I just wanted to let you all know that I just updated the collectl-utils package which contains a utility called colmux that I think can revolutionize one’s ability to see what’s going on with any resource on clusters of almost any size having tested it on clusters of over 2000 nodes with great results. Basically it’s a collectl multiplexor, and by that I mean it’s a utility that starts a copy of collectl running on multiple systems, multiplexes the output back to a single point, sorts it by a specific column number and displays the output in a continuously refreshing window in a top-like fashion. It can also be used to multiplex historical data as well. Since it can support almost anything collectl can monitor (which is substantial), you can quickly identify anything from a busy nfs client or server, a slow disk anywhere in the cluster, a network interface generating too many errors, a slow infiniband link, a system doing too many interrupts or almost anything else you can think of. It’s even been used to find the systems running at the highest temperatures on a multi-thousand node cluster during a top500 linpack run! I’ve successfully used it to find systems on a cluster leaking slab memory and it only took seconds. If you want to read more, you can read more about it on the collectl-utils project page on sourceforge. There’s even a nifty photo of it running using an alternate output format displaying the CPU load on 192 system, all on the same line once a second, taking 3 30” displays side-by-side! -mark -- To unsubscribe from this list: send the line "unsubscribe linux-admin" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html