On 4/5/19 8:59 AM, Stephen John Smoogen wrote: > Clement brought up that spring cleaning of our ansible playbooks would > be a good idea. This is painfully obvious during our previous > update/reboot cycles where we have had services not updated or > restarted correctly so that systems did not come up well when we > rebooted. > > I have opened https://pagure.io/fedora-infrastructure/issue/7695 which > is the tracking ticket for this problem. I am proposing that we do > major updates something like the following in the future. We can tweak > as we find better ways to do them in clusters later. Thanks for bringing this up smooge! I completely agree we should clean up/make sure the manual/upgrade/* playbooks are good and up to date. > If you maintain a service, take that playbook and add comments for the > following: > a. Who is the current maintainer > b. Date when that was last updated > c. Who tested the upgrade and when > d. General comments to explain what things are doing. > > If the playbook should be retired, removed, killed, etc please do so. I think we need things like maintainer for all our apps. ;) > My goal will be to make our update schedules something like this: > > Day 1: > a. Run update playbooks on staging instances. > b. Fix any problems shown by those. > c. Run general update vhost_update on staging instances > d. Reboot staging instances. > e. Fix problems found from this. > > Day 2: > a. Access if day 1 was a complete failure and stop upgrade cycle > b. Run update playbooks on low priority systems > c. Fix any problems shown by those. > d. Run general update vhost_update on staging instances > e. Reboot staging instances. > f. Fix problems found from this. > > Day 3: > a. Access if day 2 was a complete failure and stop upgrade cycle > b. Run update playbooks on high priority systems > c. Fix any problems shown by those. > d. Run general update vhost_update on staging instances > e. Reboot staging instances. > f. Fix problems found from this. Should all of those have staging? Or should it be staging then build then the rest? or staging, low pri, then high pri? > > This should cut down the extra long hours and extended outages we have > needed to do in the last couple of reboot cycles. Well, so some background (not for smooge as he knows all this, but others reading): In the past the way we did mass/update reboots has changed a few times. The most recent incarnation has been doing staging on a friday or monday, then doing the 'build' machines on one day (basically anything on bvirthost) and then doing 'the rest' on the next day. Sometimes due to time we have compressed the two things into one (long) day. During these we list out all the virthosts/hardware machines, and the sysadmins take them, update then, reboot them and confirm they come up. Then at the end we look at nagios and clear up any alerts before calling it done. The reason for this seperation was because they each had different 'users': We could notify the build outage to just devel-announce, the 'everything else' to announce. Of course now we have to announce the staging at least to centos folks due to keeping pkgs01.stg in sync via repospanner. One thing I have done a few times that I think helped a LOT as far as time is to actually just apply updates to everything before the reboot cycle. This saved us all the time waiting for updates to apply (which can sometimes on some machines take a really long time). Of course this means that machines run for a time with the updates, but with no restarted processes. I am not sure we will easily be able to seperate out what 'manual' playbooks to run for what servers, and additionally in most cases updates on our apps are done outside our updates cycles (ie, pagure would update to 5.4 manually when it's out/desired, we wouldn't expect a pending update when we do our normal OS update cycles) So some radical ideas: * What if we just daily auto-apply all updates. (We already do daily apply security updates on fedora instances). This would break things from time to time, but I suspect only particular things, not everything all at once. We would also still need reboot cycles. * What if we just daily auto-apply security updates? (This would reduce the breakage from all updates some). Reboots still needed. * I thought of the idea someday of doing reboots and having no outage. Unfortunately, that requires database clustering. At the time the clustering was all horrible, but it might be better these days. If we did have that however, we could just do reboots when we liked. I'm not sure there's a great answer here... I think when we decide to do a update/reboot cycle we should apply updates up front to save time/pain and suppose if we can easily determine what manual playbooks to run we could run those too. kevin
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@xxxxxxxxxxxxxxxxxxxxxxx