Re: Proposal: Use a calendar to limit rate of change

Kevin Fenzi <kevin@xxxxxxxxx> · Sat, 12 Aug 2017 10:25:17 -0700

On 08/11/2017 09:05 AM, Randy Barlow wrote:
> Hello!
> 
> I was chatting with puiterwijk this morning (well, I guess his afternoon
> ☺) about the difficult week the systems team has had. There were quite a
> few changes all happening at once this week:
> 
> * The update/reboot cycles
> * RHEL 7.4 was released
> * Pagure over dist-git was turned on
> * Bodhi was reconfigured to talk to Pagure instead of Pkgdb
> * A fedmsg update
> * More things?
* A netapp failure at the start of one of the outages that broke all nfs
mounts and needed us to wait until it was sorted before we could move
forward on the outage.
* because of the above and we were in a hurry I updated everything at
once, but accidentally did the cloud machines too, causing some issues
that had to be sorted before we were ready for them.

> Patrick told me that it's been a huge amount of work for him and I
> assume others this week, because many of the above didn't go so smoothly.

Yeah. Sadly so...

> I think it's expected and frankly normal for updates to not go smoothly
> - we are all stretched pretty thin and we don't have a formal QE team to
> make sure that our infra apps are ready to ship. Thus, I think we should
> expect changes like the above to be "bumpy" and instead should develop a
> plan to help smooth out the bumps.

Yeah. We always try and plan for things being rocky, just this week a
lot of them hit at once.
> 
> I propose that we use our infra calendar (or even a new calendar if
> preferred) to more formally schedule infrastructure changes, with the
> goal being to avoid weeks like this one (where so many large changes
> landed at once). i.e., if I want to make a Bodhi deployment on Monday
> and I see that there's a Pagure deployment already scheduled for Monday,
> I should consider a different day.

Seems reasonable. Some things are not easily in our control though, like
mass update/reboot cycles we try and do the week before freezes if
possible (so we are in good shape if the freeze streches out a bit), and
we have no control over new rhel versions or updates. :) But yes, we
could try and control our own app updates a bit better.

> Of course, we can always have exceptions. Sometimes we might have "flag
> days" where we do want/need two apps to upgrade together. Or maybe
> sometimes we do want to take advantage of an update/reboot cycle to
> sneak in an app upgrade, if the systems team is comfortable of course.
> That's fine when needed, but when not needed I think spreading out the
> changes will help our systems team avoid insane weeks like this one.
> 
> Thoughts?

Sure, it's worth trying. ;)

I also think the idea of doing a mass update/reboot of staging every
monday might help out as we can make more sure things are in an ok state
for hitting prod.

kevin

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx