Greetings. I think we have some lessons learned and things we could improve based on the issues we ran into yesterday on the mass reboot/updates. ;) Issues/Observations: * We can't seem to complete everything in a 2 hour window. We should block out more time, and/or have things more organized. * The build system /mnt/koji issues are due to a guest that was moved from one machine to a new one, and then somehow started on both after reboots. ;( * There were a few cases of too many cooks doing things at once to a machine. * Some physical machines were poorly or not at all labeled in things like pdu's and serial consoles. * We need to be better about retiring machines. Sometimes it's hard to see what shouldn't be up or should be. Ideas/improvements: * I'd like to look at splitting all our hosts into 3 groups (based on who we need to notify about reboots or outages): a) End users will see/notice an outage if this machine is down/not working. b) Fedora package maintainers or contributors will notice if this machine/service is not working/down. c) Everything else. Including things that if they were single instances would fit in the above, but are spread out, so they can be rebooted/updated one at a time (ie, app servers, etc). I've made a tenative list with all our hosts in these groups: ~kevin/mass-reboot-list on puppet01. Please look and see if you see anything that looks wrong or needs adjusting. With this split out, we can do any machines in "c" as we like as long as we are careful, we can do 'b' machines if we announce to devel-announce and schedule a window and 'a' machines if we announce to the main fedora announce list and schedule a window. All the windows should be shorter than what we saw yesterday. * We might look at having a updates miester (czar?:) who would be the only one allowed to touch machines in a read/write way. By default everyone else is hands off unless the updates miester asks them to work on something. This would allow us to not interfere with each other or duplicate effort. * Seth is working on tooling to tell us anytime we have a virtual machine thats set to start on boot, but not started now, or not set to start on boot but started now. * We need to go and label things in all the pdu's etc. I can look at doing that and writing up a file somewhere with all the places a particular machine is. Then, the ones we can't find, we will fill in when smooge and I are out at phx2. * I have started a SOP for retiring machines. It needs a lot of work: https://fedoraproject.org/wiki/Infrastructure_retire_machine_SOP please modify and clean up. The goal should be making it very clear when a machine has been retired so we don't confuse it with anything active. There is a new rhel5 kernel out (yes, right after we applied the last one yesterday.), so I would suggest we look at implementing some or all of these that make sense for those updates. ;) Thoughts? Rants? more suggestions? kevin
Attachment:
signature.asc
Description: PGP signature
_______________________________________________ infrastructure mailing list infrastructure@xxxxxxxxxxxxxxxxxxxxxxx https://admin.fedoraproject.org/mailman/listinfo/infrastructure