Re: Proposal: Ansible Spring Cleaning: Update/upgrade scripts

Kevin Fenzi <kevin@xxxxxxxxx> · Fri, 5 Apr 2019 10:15:11 -0700

On 4/5/19 8:59 AM, Stephen John Smoogen wrote:
> Clement brought up that spring cleaning of our ansible playbooks would
> be a good idea. This is painfully obvious during our previous
> update/reboot cycles where we have had services not updated or
> restarted correctly so that systems did not come up well when we
> rebooted.
> 
> I have opened https://pagure.io/fedora-infrastructure/issue/7695 which
> is the tracking ticket for this problem.  I am proposing that we do
> major updates something like the following in the future. We can tweak
> as we find better ways to do them in clusters later.

Thanks for bringing this up smooge!

I completely agree we should clean up/make sure the manual/upgrade/*
playbooks are good and up to date.

> If you maintain a service, take that playbook and add comments for the
> following:
> a. Who is the current maintainer
> b. Date when that was last updated
> c. Who tested the upgrade and when
> d. General comments to explain what things are doing.
> 
> If the playbook should be retired, removed, killed, etc please do so.

I think we need things like maintainer for all our apps. ;)

> My goal will be to make our update schedules something like this:
>
> Day 1:
> a. Run update playbooks on staging instances.
> b. Fix any problems shown by those.
> c. Run general update vhost_update on staging instances
> d. Reboot staging instances.
> e. Fix problems found from this.
> 
> Day 2:
> a. Access if day 1 was a complete failure and stop upgrade cycle
> b. Run update playbooks on low priority systems
> c. Fix any problems shown by those.
> d. Run general update vhost_update on staging instances
> e. Reboot staging instances.
> f. Fix problems found from this.
> 
> Day 3:
> a. Access if day 2 was a complete failure and stop upgrade cycle
> b. Run update playbooks on high priority systems
> c. Fix any problems shown by those.
> d. Run general update vhost_update on staging instances
> e. Reboot staging instances.
> f. Fix problems found from this.

Should all of those have staging? Or should it be staging then build
then the rest? or staging, low pri, then high pri?
> 
> This should cut down the extra long hours and extended outages we have
> needed to do in the last couple of reboot cycles.

Well, so some background (not for smooge as he knows all this, but
others reading):

In the past the way we did mass/update reboots has changed a few times.
The most recent incarnation has been doing staging on a friday or
monday, then doing the 'build' machines on one day (basically anything
on bvirthost) and then doing 'the rest' on the next day. Sometimes due
to time we have compressed the two things into one (long) day. During
these we list out all the virthosts/hardware machines, and the sysadmins
take them, update then, reboot them and confirm they come up. Then at
the end we look at nagios and clear up any alerts before calling it
done. The reason for this seperation was because they each had different
'users': We could notify the build outage to just devel-announce, the
'everything else' to announce. Of course now we have to announce the
staging at least to centos folks due to keeping pkgs01.stg in sync via
repospanner.

One thing I have done a few times that I think helped a LOT as far as
time is to actually just apply updates to everything before the reboot
cycle. This saved us all the time waiting for updates to apply (which
can sometimes on some machines take a really long time). Of course this
means that machines run for a time with the updates, but with no
restarted processes.

I am not sure we will easily be able to seperate out what 'manual'
playbooks to run for what servers, and additionally in most cases
updates on our apps are done outside our updates cycles (ie, pagure
would update to 5.4 manually when it's out/desired, we wouldn't expect a
pending update when we do our normal OS update cycles)

So some radical ideas:

* What if we just daily auto-apply all updates. (We already do daily
apply security updates on fedora instances). This would break things
from time to time, but I suspect only particular things, not everything
all at once. We would also still need reboot cycles.

* What if we just daily auto-apply security updates? (This would reduce
the breakage from all updates some). Reboots still needed.

* I thought of the idea someday of doing reboots and having no outage.
Unfortunately, that requires database clustering. At the time the
clustering was all horrible, but it might be better these days. If we
did have that however, we could just do reboots when we liked.

I'm not sure there's a great answer here... I think when we decide to do
a update/reboot cycle we should apply updates up front to save time/pain
and suppose if we can easily determine what manual playbooks to run we
could run those too.

kevin

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@xxxxxxxxxxxxxxxxxxxxxxx