Hi Everyone, I took some notes while we were rebooting boxes I wanted to share them with everyone for future outages. Ordering of the bounces: 1. xen14: puppet is on there and if that is back up first we have a place to stand for pushing out any changes (dns changes for example via puppet) - xen14 takes about 4 minutes to restart/POST 2. xen15: bastion01, db02 are on there - same 4 minute restart window once this is up you'll want to logout of bastion02 and into bastion01 so you have a firm place to do the xen05 reboots from which will take out bastion02. 3. edit dns on puppet to remove proxy01 from the wildcard/roundrobin and push that to the ns* servers and verify. 4. xen05: bastion02 (openvpn), proxy01 - 4-5 minutes for this machine to restart. once xen05 is completely up log back in and verify the vpn is back online 5. edit dns remove all the other proxy hosts and put proxy01 back in. Push and verify 6. virthost01 - I had to halt each of the kvms from a login - virsh shutdown didn't work. - 4 minute restart time on the hw. Note: make sure virthost01 is completely up - especially fas03. since taking down virthost02 next will take out fas02- you want to make sure you don't leave fas01 all by itself. 7. virthost02 - fas02 was not setup to autostart - that's now fixed. 8. virthost13 - uneventful 9. xen03 - spin01 spewed lots of umount issues - those are from the spin creation paths - they can be safely ignored - fas01.stg was running on xen03 according to the logs but there's no definition for it on the system - so not sure what the story is there. - neither of the other staging hosts were set to autostart 10. xen04: we apparently have a number of hosts w/only one dns record internally and they point to ns03 only. B/c when ns03 went away - lots of things got VERY VERY SLOW trying to resolve names. This is on my list to address. You must wait for xen04 to be completely up and ns03 running before you can take down xen07. Otherwise we'll be w/o dns internally to phx. 11. xen07: iscsi disks didn't come up right away - this kept ns04 from coming up immediately - needed to run /etc/init.d/iscsi start and they showed up. 12. xen09: uneventful 13. xen10: log01 needed an fsck b/c of the time since last mount - this took a long time. 14. xen11: secondary1 needed an fsck. also a 5-6 minute hw reboot time. 15. xen12: db1->db01 naming change kept it from coming up at boot b/c of the 'auto' symlink to db1. db01 had to fsck 16. cnode01 - 6-7 minute reboot time - nothing was set to autostart in xen - this is now fixed - autoqa01 and dhcp02 are set to autostart 17. db03: fsck took FOREVER to complete and this takes a lot of things done - for the future move db03 reboot higher up the stack, just in case. This machine's restart/POST time is REALLY high like 7-10 minutes. The console for it is less than forthcoming, too. 18. backup01: uneventful At this point internal was back online - except for the build xen systems and servers. External hosts: 19. - bodhost01: 5-6 minute machine reboot time - people01 - uneventful. - ibiblio01 - 5-7 minute machine reboot time. uneventful 20. - internetx01: uneventful - osuosl01: uneventful 21. - sb2 - must wait for ibiblio01 to be up b/c of not having any external name servers - sb3 - uneventful - sb4 - hosted1 listed more 'maxmem' in its config that sb4 had available - so that had to be edited down. Not sure how that EVER started - sb5 - uneventful 22. telia01 - proxy5 did not restart on its own - unknown as to WHY yet - but it did start manually. - retrace01 was not set to autostart tummy1 - uneventful Now all the proxy* rebooting is over so we can: 23. edit dns: put the other proxy hosts in the wildcard/RR - push and verify Build boxes: - bxen03 had koji2 listed in its set of hosts - but it wasn't running. This led to some confusion as to how to start the hosts on bxen03 b/c of insufficient memory for all guests. Eventually I realized bxen04 is where koji02 was running and that the left over guest file was never cleaned up on bxen03. Things to think about post-outage: - check all the raid arrays for lost disks - we saw this a couple of times - it's not pleasant. - check for downed vpns and/or broken resolution - we need to get a firm handle on why this is a hassle so often. Overall things to think about for the future: 1. dumping a complete virsh list - including how much memory is actually being used per vm per server before we start reboots 2. checking what disks need fscks because of mounted time and doing those earlier or separately. 3. verifying that all running vms are: a. intended to be running b. have a config file c. are set to autostart 4. verifying that all NOT running vms are: a. intended to be off b. are NOT set to autostart thoughts welcome. -sv _______________________________________________ infrastructure mailing list infrastructure@xxxxxxxxxxxxxxxxxxxxxxx https://admin.fedoraproject.org/mailman/listinfo/infrastructure