Re: Thanks to every one

Jonathan Billings <billings@xxxxxxxxxx> · Tue, 18 Jul 2017 09:01:07 -0400

On Sun, Jul 16, 2017 at 06:02:15PM +0100, Pete Biggs wrote:
> > 
> > The physicists and mathematicians who count there need high durations.
> 
> Yes. I too run HPC clusters and I have had uptimes of over 1000 days -
> clusters that are turned on when they are delivered and turned off when
> they are obsolete. It is crucial for long running calculations that you
> have a stable OS - you have never seen wrath like a computational
> scientist whose 200 day calculation has just failed because you needed
> to reboot the node it was running on.

I too was a HPC admin, and I knew people who believed the above, and
their clusters were compromised.  You're running a service where the
weakest link are the researchers who use your cluster -- they're able
to run code on your nodes, so local exploits are possible.  They often
have poor security practices (share passwords, use them for multiple
accounts).

Also, if your researchers can't write code that performs checkpoints,
they're going to be awfully unhappy when a bug in their code makes it
segfault 199 days into a 200 day run.

Scheduled downtime and rolling cluster upgrades is a necessity of
HPC cluster administration.  I do wish that the ksplice/kpatch stuff
was available in CentOS. 

-- 
Jonathan Billings <billings@xxxxxxxxxx>
_______________________________________________
CentOS mailing list
CentOS@xxxxxxxxxx
https://lists.centos.org/mailman/listinfo/centos