I mentioned this in the last meeting, and wanted to bring it up here as well. We have made a number of improvements in our ansible vm deployment and host setup in the last few months. This includes (but is not limited to): * guests now get a watchdog device that will reboot them if they go unresponsive. This would have come in handy a few times in the past for overloaded machines that we had to notice and power cycle manually. * guests are deployed with maxmem and maxvpcus set to 5x what their current configured allotment is. This means we should be able to hot add memory and cpus without rebooting machines. This would have come in handy a few times in the past to avoid downtime. * Hosts are now using qemu-kvm-rhev, which has some more features over the base qemu-kvm. This should allow us to live migrate guests that don't have shared storage. This will sometimes be handy. * guests ip config is now in ansible, so everything should be the same on all of them config wise and they should come up with the correct config on boot. This will save me time having to update resolv.conf's after reboots or the like. It also (with some caveats) will allow us to change ip's of guests in ansible and have it make the change for us. * I've tweaked our kickstart's and process a bit to bring up virthost machines easier (passing them the right stuff to use bridges from the start, etc). * Since it's been a while since we deployed some machines we may have added dependencies or broken things so they don't deploy. I would like to fix these things. Like when reinstalling proxy12 yesterday I ran into the fact that it needs the hostkey copied to it or httpd won't start and it needs docs synced before it can be added into service, we should fix those things in playbook so we don't mess them up. So, with all those changes, I would like to look at reinstalling everything. :) Some machines should be trivial to do (where we have an 01 and 02, we can do 01, then 02 with no downtime). Some of them can be just done when internal users aren't using them (things like rawhide composer when it's not composing, or backup server not backing up, etc). A few will require downtime or making a new instance and copying everything to it and switching over. A few will definitely require downtime (the database servers). I've already started a bit with ibiblio hosts (since we have to move things around there due to new drives anyhow). Probibly we should get all the easy ones done quickly and then look and see and schedule the ones that will require downtime or other special needs. Ideally, I would like to have this all done before the F24 Alpha freeze. Thoughts? comments? rotten fruits? kevin
Attachment:
pgpUIdISb8KF7.pgp
Description: OpenPGP digital signature
_______________________________________________ infrastructure mailing list infrastructure@xxxxxxxxxxxxxxxxxxxxxxx http://lists.fedoraproject.org/admin/lists/infrastructure@xxxxxxxxxxxxxxxxxxxxxxx