On Fri, 5 Apr 2019 at 13:15, Kevin Fenzi <kevin@xxxxxxxxx> wrote: > > On 4/5/19 8:59 AM, Stephen John Smoogen wrote: > > Clement brought up that spring cleaning of our ansible playbooks would > > be a good idea. This is painfully obvious during our previous > > update/reboot cycles where we have had services not updated or > > restarted correctly so that systems did not come up well when we > > rebooted. > > > > I have opened https://pagure.io/fedora-infrastructure/issue/7695 which > > is the tracking ticket for this problem. I am proposing that we do > > major updates something like the following in the future. We can tweak > > as we find better ways to do them in clusters later. > > Thanks for bringing this up smooge! > > I completely agree we should clean up/make sure the manual/upgrade/* > playbooks are good and up to date. > > > If you maintain a service, take that playbook and add comments for the > > following: > > a. Who is the current maintainer > > b. Date when that was last updated > > c. Who tested the upgrade and when > > d. General comments to explain what things are doing. > > > > If the playbook should be retired, removed, killed, etc please do so. > > I think we need things like maintainer for all our apps. ;) > > > My goal will be to make our update schedules something like this: > > > > Day 1: > > a. Run update playbooks on staging instances. > > b. Fix any problems shown by those. > > c. Run general update vhost_update on staging instances > > d. Reboot staging instances. > > e. Fix problems found from this. > > > > Day 2: > > a. Access if day 1 was a complete failure and stop upgrade cycle > > b. Run update playbooks on low priority systems > > c. Fix any problems shown by those. > > d. Run general update vhost_update on staging instances > > e. Reboot staging instances. > > f. Fix problems found from this. > > > > Day 3: > > a. Access if day 2 was a complete failure and stop upgrade cycle > > b. Run update playbooks on high priority systems > > c. Fix any problems shown by those. > > d. Run general update vhost_update on staging instances > > e. Reboot staging instances. > > f. Fix problems found from this. > > Should all of those have staging? Or should it be staging then build > then the rest? or staging, low pri, then high pri? So I was leaving the definitions vague as some build boxes are high priority (no redundancy or major outage) and some are low priority because they have a large amount of redundancy. I was figuring low priority: - proxies not shared on high priority external virthosts - builders and other high redundancy build systems - openshift systems IF updated and drained properly - other external services which have low SLE high priority: services with no redundancy: - databases - pagure - src - etc etc services with high outage effects - koji? - etc etc > > > > This should cut down the extra long hours and extended outages we have > > needed to do in the last couple of reboot cycles. > > Well, so some background (not for smooge as he knows all this, but > others reading): > > In the past the way we did mass/update reboots has changed a few times. > The most recent incarnation has been doing staging on a friday or > monday, then doing the 'build' machines on one day (basically anything > on bvirthost) and then doing 'the rest' on the next day. Sometimes due > to time we have compressed the two things into one (long) day. During > these we list out all the virthosts/hardware machines, and the sysadmins > take them, update then, reboot them and confirm they come up. Then at > the end we look at nagios and clear up any alerts before calling it > done. The reason for this seperation was because they each had different > 'users': We could notify the build outage to just devel-announce, the > 'everything else' to announce. Of course now we have to announce the > staging at least to centos folks due to keeping pkgs01.stg in sync via > repospanner. > > One thing I have done a few times that I think helped a LOT as far as > time is to actually just apply updates to everything before the reboot > cycle. This saved us all the time waiting for updates to apply (which > can sometimes on some machines take a really long time). Of course this > means that machines run for a time with the updates, but with no > restarted processes. > > I am not sure we will easily be able to seperate out what 'manual' > playbooks to run for what servers, and additionally in most cases > updates on our apps are done outside our updates cycles (ie, pagure > would update to 5.4 manually when it's out/desired, we wouldn't expect a > pending update when we do our normal OS update cycles) > > So some radical ideas: > > * What if we just daily auto-apply all updates. (We already do daily > apply security updates on fedora instances). This would break things > from time to time, but I suspect only particular things, not everything > all at once. We would also still need reboot cycles. > > * What if we just daily auto-apply security updates? (This would reduce > the breakage from all updates some). Reboots still needed. > > * I thought of the idea someday of doing reboots and having no outage. > Unfortunately, that requires database clustering. At the time the > clustering was all horrible, but it might be better these days. If we > did have that however, we could just do reboots when we liked. > > I'm not sure there's a great answer here... I think when we decide to do > a update/reboot cycle we should apply updates up front to save time/pain > and suppose if we can easily determine what manual playbooks to run we > could run those too. > > kevin > > > > _______________________________________________ > infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx > To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx > Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html > List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines > List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@xxxxxxxxxxxxxxxxxxxxxxx -- Stephen J Smoogen. _______________________________________________ infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@xxxxxxxxxxxxxxxxxxxxxxx