This came up in a different venue and pingou and I have continued to talk about it. Seemed that this was the right place to bring the discussion though. Some observations: * Pkgdb2 and a call for testing in staging was announced well in advance of the deployment to production (good) but not everyone understood that we were going to be breaking API (bad). * There were people inside of fedora infrastructure and outside of infrastructure who were surprised by the API break. There were also some community members and infrastructure members who heeded the call for testing and both gave feedback and ported before the deployment. * There was a FAS2 update that pkgdb2 depended upon. That was also pending in stg for a long time and also had some minor API changes (IIRC, all unintentional. I hotfixed one of them that was simply a bug last week). These also caused issues for some scripts. * Unexpected problems: we had things that we didn't know used the pkgdb API, things that weren't tested in stg because stg couldn't replicate that part of production, and things that were ported but mistakes caused the ported scripts to not be deployed or to point at stg instead of production. I saw that we had the right people on IRC throughout the day working on analyzing and patching all of the broken things so. However, this was somewhat by accident and some of those people were surprised that they spent their day doing this. Some ideas for doing major deployments in the future: 1: We have to make people aware when a new deployment means API breaks. * Be clear that the new deployment means API breaks in every call for testing. Send announcements to infrastructure list and depending on the service to devel list. * Have a separate announcement besides the standard outage notification that says that an API breaking update is planned for $date * When we set a date for the new deployment, discuss it at least once in a weekly infrastructure meeting. * See also the solution in #3 below 2: It would be really nice for people to do more testing in stg. * Increase rube coverage. rube does end-to-end testing so it's better at catching cross-app issues where API changes better than unittests which try to be small and self-contained - A flock session where everyone/dev in infra gets to write one rube test so we get to know the framework * Run rube daily - Could we run rube in an Xvfb on an infrastructure host? * Continue to work towards a complete replica of production in the stg environment. 3: "Mean time to repair is more important than mean time between failure." It seems like anytime there's a major update there's unexpected things that break. Let's anticipate the unexpected happening. * Explicitly plan for everyone to spend their day firefighting when we make a major new deployment. If you've already found all the places your code is affected and pre-ported it and the deployment goes smoothly then hey, you've got 6 extra working hours to shift back to doing other things. If it's not smooth, then we've planned to have the attention of the right people for the unexpected difficulties that arise. * As part of this, we need to identify people outside of infrastructure that should also be ready for breakage. Reach out to rel-eng, docs, qa, cvsadmins, etc if there's a chance that they will be affected. 4: Related to the FAS release: Buggy code happens. How can we make it happen less? * More unittests would be good however we know from experience with bodhi that unittests don't catch a lot of things that are changes in behaviour rather than true "bugs". Unexpected API changes that cause people porting pain can be as simple as returning None instead of an empty list which causes a no-op iteration in running code to fail while the unittests survive because they're checking that "no results were returned". * Pingou has championed making API calls and WebUI calls into separate URL endpoints. I think that coding style makes it easier to control bugs related to updating the webui while trying to preserve the API so we probably want to move to that model as we move onto the next major version of our apps. * Not returning json-ified versions of internal data structures (like database tables) but instead parsing the results and returning a specific structure would also help divorce internal changes from external API. What should we apply this to? * Probably can skip if: - Things that we don't think have API breaks - Things that are minor releases (hopefully these would correlate with not having API breaks :-) - Leaf services that are not essential to releasing Fedora. + ask, nuancier, elections, easyfix, badges, paste, nuancier + There's a lot of boderline cases too -- is fedocal essential enough to warrant being under this policy? Since the wiki is used via its API should that fall under this as well? Comments, thoughts, other ideas? Do we need to "ratify" something like this at a meeting? What's the next app deploy where we'll want to enact this? Maybe bodhi2 ;-)? -Toshio
Attachment:
pgpX5ZZP1gKe6.pgp
Description: PGP signature
_______________________________________________ infrastructure mailing list infrastructure@xxxxxxxxxxxxxxxxxxxxxxx https://admin.fedoraproject.org/mailman/listinfo/infrastructure