Spring cleaning

Kevin Fenzi <kevin@xxxxxxxxx> · Tue, 9 Feb 2016 12:48:23 -0700

I mentioned this in the last meeting, and wanted to bring it up here as
well. 

We have made a number of improvements in our ansible vm deployment and
host setup in the last few months. This includes (but is not limited
to): 

* guests now get a watchdog device that will reboot them if they go
  unresponsive. This would have come in handy a few times in the past
  for overloaded machines that we had to notice and power cycle
  manually. 

* guests are deployed with maxmem and maxvpcus set to 5x what their
  current configured allotment is. This means we should be able to hot
  add memory and cpus without rebooting machines. This would have come
  in handy a few times in the past to avoid downtime.

* Hosts are now using qemu-kvm-rhev, which has some more features over
  the base qemu-kvm. This should allow us to live migrate guests that
  don't have shared storage. This will sometimes be handy. 

* guests ip config is now in ansible, so everything should be the same
  on all of them config wise and they should come up with the correct
  config on boot. This will save me time having to update resolv.conf's
  after reboots or the like. It also (with some caveats) will allow us
  to change ip's of guests in ansible and have it make the change for
  us. 

* I've tweaked our kickstart's and process a bit to bring up virthost
  machines easier (passing them the right stuff to use bridges from the
  start, etc). 

* Since it's been a while since we deployed some machines we may have
  added dependencies or broken things so they don't deploy. I would
  like to fix these things. Like when reinstalling proxy12 yesterday I
  ran into the fact that it needs the hostkey copied to it or httpd
  won't start and it needs docs synced before it can be added into
  service, we should fix those things in playbook so we don't mess them
  up. 

So, with all those changes, I would like to look at reinstalling
everything. :) 

Some machines should be trivial to do (where we have an 01 and 02, we
can do 01, then 02 with no downtime). Some of them can be just done
when internal users aren't using them (things like rawhide composer
when it's not composing, or backup server not backing up, etc). A few
will require downtime or making a new instance and copying everything
to it and switching over. A few will definitely require downtime (the
database servers). 

I've already started a bit with ibiblio hosts (since we have to move
things around there due to new drives anyhow). Probibly we should get
all the easy ones done quickly and then look and see and schedule the
ones that will require downtime or other special needs. Ideally, I
would like to have this all done before the F24 Alpha freeze. 

Thoughts? comments? rotten fruits?

kevin
Attachment:
pgpUIdISb8KF7.pgp

Description: OpenPGP digital signature
_______________________________________________
infrastructure mailing list
infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
http://lists.fedoraproject.org/admin/lists/infrastructure@xxxxxxxxxxxxxxxxxxxxxxx