Re: shutdown work (plan)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



All,

just a reminder and update of this planned work for this shutdown (with the obvious beamline typo fixed). The main new detail really is that IIRC i02 will want us to power off everything before Saturday (i.e. some time on Friday.) and it will remain powered of for a while (week or so.)

IMHO it would be good if we can start considering the merge to UserMode tomorrow to make sure we are ready to check it out on Friday.

The Lustre I/O errors are hopefully going to go away with the Lustre upgrade on the servers, so no need to reproduce this until after the Lustre maintenance.

Lustre and GPFS client and kernel modules have been built, cfengine has been updated with the correct versions (I hope).

Rsyncing /dls/i18 to Lustre might have to be delayed slightly as lustre03 is currently ~88% full and I'd like to clear some/most of this first...

Richard, did you have more details when various switch upgrades etc have been scheduled?

Cheers,
Frederik

On 04/03/15 14:47, Frederik Ferner wrote:
All,

as the shutdown is only really 2.5 weeks (+easter weekend) and we're
relatively short of people (+ HEPIX conference...) and I've been left in
charge of planning the shutdown, I thought I'd suggest a draft plan for
your consideration. As usual we'll have tomorrows meeting to discuss
details...

Richard, this doesn't yet include anything you may planned, I think I
lost track...

Basic dates:
first day of shutdown: Friday March 13th.
First machine startup day: Tuesday April 7th
beamline startup: Thursday April 9th.

Bank Holidays: April 3rd, April 6th.
HEPIX (aka Greg, Tina and me away in Oxford, available in emergency):
March 23rd-27th

Beamline updates have been scheduled for 17th+18th March
(Tuesday+Wednesday).

I'm hoping that we'll manage to allocate (and setup) any requested IP
addresses on the primary network early in the shutdown...

So, initial plan:

Friday 13th:
* allocation IP addresses on Primary network (various FPs)
* merge cfengine trunk to usermode early in the morning (and check out)
* update stable repositories in the afternoon
* start rsync for i18 to lustre

Monday 16th:
* generate final list of beamline machines to update
* allocate who starts on which beamline
* verify that usermode checkout has worked, check package installs on
and cfengine on selected central servers, at least one Lustre
client/GPFS client, maybe even one or two cluster nodes
* switch /dls/i18 to lustre (needs to be arranged with beamline)

Tuesday 17th and Wednesday 18th:
* beamline updates

Thursday 19th:
* GPFS and Lustre at risk periods, server upgrades etc,[1]
* start rolling upgrade of clusters?
* central servers...

Friday 20th:
* mop up, general stuff, I'm sure there'll be loads...
* more central servers...

Week 23rd-27th:
* primary archiver RAM+localhome expansion
* ws in CIA23
* investigate/fix b18-ws* powersave issue
* DMZ work (mount production FS if they aren't already...)
* install beamline servers[2]
* sr06i-di-serv-01 re-install
* beamline workstations installs/replacements: I'm sure there are some

Week 30th-2nd:
* anything that's left/come in during the shutdown...


Additional stuff which could be started early:
* attempt to reproduce cluster node slowdown (Intel want more data...)
* attempt to reproduce NFS I/O errors on Lustre (again: Intel want more
data)
* archiving
* data purging

Preparation that still needs to happen:
* compile Lustre client modules for target kernel
* compile GPFS client modules for target kernel
* update cfengine with kernel version to be installed
* anything I've forgotten...

[1] or do people feel this should be done before we forcefully reboot
beamline machines?
[2] if anyone has a list of all beamline servers we promised to
install/replace, let me know, otherwise I'll try to generate the list
before tomorrows meeting



--
Frederik Ferner
Senior Computer Systems Administrator   phone: +44 1235 77 8624
Diamond Light Source Ltd.               mob:   +44 7917 08 5110
(Apologies in advance for the lines below. Some bits are a legal
requirement and I have no control over them.)

--
This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster




[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux