Re: Proposal: Ansible Spring Cleaning: Update/upgrade scripts

Stephen John Smoogen <smooge@xxxxxxxxx> · Fri, 5 Apr 2019 13:34:51 -0400

On Fri, 5 Apr 2019 at 13:15, Kevin Fenzi <kevin@xxxxxxxxx> wrote:
>
> On 4/5/19 8:59 AM, Stephen John Smoogen wrote:
> > Clement brought up that spring cleaning of our ansible playbooks would
> > be a good idea. This is painfully obvious during our previous
> > update/reboot cycles where we have had services not updated or
> > restarted correctly so that systems did not come up well when we
> > rebooted.
> >
> > I have opened https://pagure.io/fedora-infrastructure/issue/7695 which
> > is the tracking ticket for this problem.  I am proposing that we do
> > major updates something like the following in the future. We can tweak
> > as we find better ways to do them in clusters later.
>
> Thanks for bringing this up smooge!
>
> I completely agree we should clean up/make sure the manual/upgrade/*
> playbooks are good and up to date.
>
> > If you maintain a service, take that playbook and add comments for the
> > following:
> > a. Who is the current maintainer
> > b. Date when that was last updated
> > c. Who tested the upgrade and when
> > d. General comments to explain what things are doing.
> >
> > If the playbook should be retired, removed, killed, etc please do so.
>
> I think we need things like maintainer for all our apps. ;)
>
> > My goal will be to make our update schedules something like this:
> >
> > Day 1:
> > a. Run update playbooks on staging instances.
> > b. Fix any problems shown by those.
> > c. Run general update vhost_update on staging instances
> > d. Reboot staging instances.
> > e. Fix problems found from this.
> >
> > Day 2:
> > a. Access if day 1 was a complete failure and stop upgrade cycle
> > b. Run update playbooks on low priority systems
> > c. Fix any problems shown by those.
> > d. Run general update vhost_update on staging instances
> > e. Reboot staging instances.
> > f. Fix problems found from this.
> >
> > Day 3:
> > a. Access if day 2 was a complete failure and stop upgrade cycle
> > b. Run update playbooks on high priority systems
> > c. Fix any problems shown by those.
> > d. Run general update vhost_update on staging instances
> > e. Reboot staging instances.
> > f. Fix problems found from this.
>
> Should all of those have staging? Or should it be staging then build
> then the rest? or staging, low pri, then high pri?

So I was leaving the definitions vague as some build boxes are high
priority (no redundancy or major outage) and some are low priority
because they have a large amount of redundancy. I was figuring

low priority:
- proxies not shared on high priority external virthosts
- builders and other high redundancy build systems
- openshift systems IF updated and drained properly
- other external services which have low SLE

high priority:
services with no redundancy:
- databases
- pagure
- src
- etc etc
services with high outage effects
- koji?
- etc etc

> >
> > This should cut down the extra long hours and extended outages we have
> > needed to do in the last couple of reboot cycles.
>
> Well, so some background (not for smooge as he knows all this, but
> others reading):
>
> In the past the way we did mass/update reboots has changed a few times.
> The most recent incarnation has been doing staging on a friday or
> monday, then doing the 'build' machines on one day (basically anything
> on bvirthost) and then doing 'the rest' on the next day. Sometimes due
> to time we have compressed the two things into one (long) day. During
> these we list out all the virthosts/hardware machines, and the sysadmins
> take them, update then, reboot them and confirm they come up. Then at
> the end we look at nagios and clear up any alerts before calling it
> done. The reason for this seperation was because they each had different
> 'users': We could notify the build outage to just devel-announce, the
> 'everything else' to announce. Of course now we have to announce the
> staging at least to centos folks due to keeping pkgs01.stg in sync via
> repospanner.
>
> One thing I have done a few times that I think helped a LOT as far as
> time is to actually just apply updates to everything before the reboot
> cycle. This saved us all the time waiting for updates to apply (which
> can sometimes on some machines take a really long time). Of course this
> means that machines run for a time with the updates, but with no
> restarted processes.
>
> I am not sure we will easily be able to seperate out what 'manual'
> playbooks to run for what servers, and additionally in most cases
> updates on our apps are done outside our updates cycles (ie, pagure
> would update to 5.4 manually when it's out/desired, we wouldn't expect a
> pending update when we do our normal OS update cycles)
>
> So some radical ideas:
>
> * What if we just daily auto-apply all updates. (We already do daily
> apply security updates on fedora instances). This would break things
> from time to time, but I suspect only particular things, not everything
> all at once. We would also still need reboot cycles.
>
> * What if we just daily auto-apply security updates? (This would reduce
> the breakage from all updates some). Reboots still needed.
>
> * I thought of the idea someday of doing reboots and having no outage.
> Unfortunately, that requires database clustering. At the time the
> clustering was all horrible, but it might be better these days. If we
> did have that however, we could just do reboots when we liked.
>
> I'm not sure there's a great answer here... I think when we decide to do
> a update/reboot cycle we should apply updates up front to save time/pain
> and suppose if we can easily determine what manual playbooks to run we
> could run those too.
>
> kevin
>
>
>
> _______________________________________________
> infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
> To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx
> Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@xxxxxxxxxxxxxxxxxxxxxxx

-- 
Stephen J Smoogen.
_______________________________________________
infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@xxxxxxxxxxxxxxxxxxxxxxx