Fedora 20 release day FedUp bug: post-mortem

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi, folks. Now things have calmed down a bit in Fedora 20 and Rawhide, I
have time to write this mail!

Many of you may already know that there was a significant issue with
upgrades to Fedora 20 around release day - 2013-12-17.

Summary of the issue
--------------------

Upgrading to Fedora 20 using version 0.7 of the FedUp tool does not
work. Upgrading with version 0.8 works (in the main - of course there
are bugs, there are always bugs).

At the time Fedora 20 was released, version 0.7 of FedUp was present in
the Fedora 18 and Fedora 19 'updates' repositories. Version 0.8 of FedUp
was present in 'updates-testing' for both Fedora 18 and Fedora 19 at the
time.

Immediate response to the issue
-------------------------------

We realized quite quickly during the course of release day support that
this was the case, though at first we thought perhaps only some upgrades
were failing. Once it became clear that all 0.7-based upgrades would
fail, several folks worked hard at communicating this to as many users
in as many places as possible, including via IRC, mailing lists, the
Common Bugs page
(https://fedoraproject.org/wiki/Common_F20_bugs#fedup-07-fail ), the
forums, and social network sites like G+. We advised using fedup 0.8
from updates-testing to upgrade.

We rapidly ensured 0.8 was submitted for stable push for both F18 and
F19. It was submitted for F19 at 2013-12-17 21:12:18 (I believe Bodhi
timestamps are UTC, so that was mid-afternoon on release day in NA) and
for F18 at 2013-12-18 11:51:47 (early morning on the day after release).

However, release engineering complications (there were some problems
with stable pushes at the time) meant it wasn't finally pushed until
2013-12-19 07:23:09 UTC for F19 (late on the day after release NA time)
and 2013-12-19 14:05:50 UTC for F18 (early morning two days after
release) and wouldn't have made it to most mirrors until 2013-12-19, two
days after release, and probably 2013-12-20 in 'early' timezones in
Europe and Asia.

Proximate cause of the issue
----------------------------

We have not yet identified the direct (proximate) cause of the bug;
doing so did not seem especially important in comparison to ensuring
news of the issue was spread as widely as possible, ensuring 0.8 was
sent stable as soon as possible, and resolving some related issues (see
later). However, QA's current inference is that there is some
incompatibility between how fedup 0.7 modifies the initramfs used by the
upgrade process and/or how it configures the upgrade boot environment,
and the expectations of the upgrade environment as it exists within the
final shipped upgrade initramfs. The upgrade initramfs is generated as
part of the release compose process, and is dependent on factors
including the versions of dracut and fedup-dracut used to build it.
Broadly, we suspect that an upgrade run with fedup 0.7 which uses an
upgrade initramfs generated with fedup-dracut 0.8 will not work, for
reasons not yet identified.

Indirect causes of the issue
----------------------------

We could perhaps make a very broad characterization of the 'indirect
causes' of the issue as follows: an upgrade using fedup depends on
several moving parts, and neither our development nor testing processes
are sufficiently robust to ensure that we cover all possible
combinations of those parts.

	fedup / fedup-dracut interdependencies
	++++++++++++++++++++++++++++++++++++++

So far as I can discern, there is not at present any policy (whether
written or enforced by some kind of mechanism) with regard to the
inter-dependencies between the 'fedup' package side of the fedup process
and the 'fedup-dracut' side of the process which involves release
engineering generating an upgrade initramfs via fedup-dracut. As this
issue suggests that not all 'fedups' work with all 'fedup-dracuts',
perhaps this is something that might be required, but we leave that to
the superior knowledge and expertise of the FedUp maintainer.

	Test procedure inadequacies
	+++++++++++++++++++++++++++

Similarly, QA's upgrade testing process clearly did not sufficiently
carefully consider the same issue. This is something we have now moved
to address.

Prior to Fedora 20's release, the test cases for fedup recommended
testing the latest version of fedup from updates-testing against the
upgrade initramfs from the development/20 tree. This procedure was a
holdover from the very early days of FedUp, when it was changing daily
and testing anything older was uninteresting, and when procedures for
the generation and publishing of the upgrade initramfs had not yet been
clearly established (and TC/RC trees did not contain one). However, it
is no longer appropriate for the more mature state of FedUp development
at this point in time, and it should have been changed earlier. We in QA
apologize to the project for this oversight.

	Other factors
	+++++++++++++

Additionally, various parties have noted in discussion of this issue
that we would have been more likely to notice it, even with our
imperfect testing procedures, if a couple of other factors had been
different:

* The lifetime of the final release candidate
* The timing of changes to fedup

In recent Fedora releases it has become something of a habit (for which
I personally bear rather a large share of the blame) for us to reach RC
stage late, iterate RCs rapidly, and often ship an RC that was built
only days or even hours before the Go/No-Go decision. This has allowed
us to fix bugs we might not otherwise have fixed and to avoid release
delays.

However, it has the obvious danger that testing of the final release
bits may not be as comprehensive as it could be. We always ensure the
formal validation testing is sufficiently complete, but an issue like
this highlights that a few more days of testing are likely to catch
things the formal validation testing process may miss for various
reasons, including the kind of deficiency noted above. Even though our
test procedure for fedup was outdated, if the final RC had lived for two
or three days before being signed off, someone would likely have
happened across this issue in time for something to be done about it.

The fact that fedup 0.8 and fedup-dracut 0.8 landed quite late in the
cycle is also relevant. fedup 0.8 was submitted for updates-testing on
2013-12-11; fedup-dracut was submitted on 2013-12-06, but the first
compose which used it was RC1, built on 2013-12-12 (there was a delay of
several days between TC5 and RC1, as blocker bugs kept appearing and
needing to be fixed before an RC1 could be spun). RC1.1 was signed off
for release on 2013-12-12. Even the mathematically-challenged will note
that this left us extremely limited time to spot the problem. (I had
tested upgrades with the updated fedup-dracut using a 'scratch built'
upgrade initramfs rather earlier, but I must have used fedup 0.8 rather
than 0.7 for my tests).

Obviously, if these fairly significant version bumps had arrived
earlier, we may have had more time to identify issues in them. If you're
wondering how they were allowed to land so late, the answer is that they
fixed blocker bugs we had identified in earlier upgrade testing, and so
were allowed through the freeze. As we all know, it is difficult to
adhere strictly to 'best practices' with Fedora's extremely short
release cycles and ambitious pace of development, but of course it would
be best in future if we can manage to avoid landing significant changes
to fedup so late, a goal to which both QA and development groups can
contribute by identifying and fixing issues at an earlier stage.

As a 'meta' note, I think a factor that contributes to all of the above
factors may be a lack of understanding outside a very few people as to
precisely how the entire fedup process works: speaking personally, I
certainly wasn't acquainted with all the subtleties until investigating
this and other issues (not that I'd confidently claim to be an expert
even now!) I think beyond Will Woods (obviously) and possibly Tim Flink
(who did a lot of early fedup testing) and Dennis Gilmore (who tends to
be the one generating the upgrade initramfs), possibly no-one really
entirely understood the whole process.

Related issues
--------------

It is probably worth noting a somewhat-related issue at this point.
fedup 0.8's major change compared to fedup 0.7 was that it introduced
checking of GPG signatures on update packages. To facilitate this, the
signing key for the release to which one is upgrading must be available
to fedup running on the release from which one is upgrading. Again, we
did not have this fully in place at the time of Fedora 20's release.

The fedora-release-19-5 update added Fedora 20's key to Fedora 19:
https://admin.fedoraproject.org/updates/FEDORA-2013-21411/fedora-release-19-5 . It was submitted on 2013-11-14 and pushed stable on 2013-12-03, so this was in place ahead of release.

However, for Fedora 18, the relevant update -
https://admin.fedoraproject.org/updates/FEDORA-2013-23598/fedora-release-18-6 - was submitted on 2013-12-18 and pushed stable on 2013-12-22 (and then we had to add a signed .treeinfo file to the Fedora 18 repositories or things *still* didn't work, which I think we did late on 2013-12-22 or on 2013-12-23). The fact that the keys weren't available for F18 was known around F20 release time, but was not considered urgent by the parties involved as we were not aware that fedup 0.7 simply would not work and consequently that it would be an urgent matter to make fedup 0.8 available and functional, and release engineering considered it a delicate operation to add the keys for Fedora 19 and 20 to Fedora 18, and one which they were not inclined to rush.

Post-release reports also make it clear that fedup will abort if GPG
keys for *any* repository fedup finds available for the target release
cannot be found. i.e., if you have RPM Fusion or another popular third
party repository configured, it's quite likely your upgrade will fail,
because third party repos didn't have the signing key issue lined up
(not surprising if we couldn't even entirely manage it ourselves). We
were not sufficiently aware of this behaviour before release, and did
not communicate it very well. The underlying causes of this are much the
same as the underlying causes of the main issue - the fedup which
enabled GPG checking landing very late, inadequate/incorrect test
procedures, and limited knowledge of the details of fedup operation
outside a small group of people.

Addressing the problems
-----------------------

I've noted above that so far as specific code responses to any of these
issues go, we should probably defer to the wisdom of the maintainer.
However, I've filed a couple of intentionally vague and open-ended
tickets on fedup to provide a forum for action:

https://github.com/wgwoods/fedup/issues/42
https://github.com/wgwoods/fedup/issues/43

In terms of QA test procedures, we (QA) have already taken action that
should help guard against a repeat of this kind of issue in future. The
FedUp test cases - for instance,
https://fedoraproject.org/wiki/QA:Testcase_upgrade_fedup_cli_previous_desktop - have been adjusted to recommend testing the latest fedup from stable or updates (not from updates-testing), and to test against the current TC/RC tree (not the daily-updated development/ tree), now TC/RC trees contain the upgrade initramfs image. The FedUp and Upgrading wiki pages (https://fedoraproject.org/wiki/FedUp and https://fedoraproject.org/wiki/Upgrading ) have also been updated to be more consistent and correct for the current state of fedup, and the Installation Guide's section on upgrading has also been updated. Our test procedures and upgrade documentation should now be much more coherent and consistent than they were just prior to Fedora 20's release.

In wider terms, this issue is another indicator on top of several
previous ones that we should redouble our efforts to get 'releaseable'
RCs built days ahead of go/no-go, rather than hours. That's a whole
story in itself, but this is something the parties involved are all
aware of and working on. Of course, the whole release process may look
somewhat different in a Fedora.next world, but as long as we have our
current release schedule and freeze policies, this issue is likely to
exist at least in essence.

It's also another good indicator that we should do whatever we can to
try and land major changes much earlier in the release cycle. This is
hardly a new observation, of course, nor an issue of which many relevant
people were previously unaware, and there are always good reasons why we
wind up landing the kitchen sink a week before release, but it's always
good to have another reminder.

On the positive side, the simple fact that this issue occurred has
probably led to a wider understanding of at least some of the details of
how fedup operates, and the fact that more people in the project have
that knowledge should aid us in future fedup development and testing: we
should be careful to keep that knowledge in mind as we build and test
future releases.

Conclusion
----------

Er, thanks for reading this far? :)
-- 
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net

-- 
devel mailing list
devel@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/devel
Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct





[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Fedora Announce]     [Fedora Kernel]     [Fedora Testing]     [Fedora Formulas]     [Fedora PHP Devel]     [Kernel Development]     [Fedora Legacy]     [Fedora Maintainers]     [Fedora Desktop]     [PAM]     [Red Hat Development]     [Gimp]     [Yosemite News]
  Powered by Linux