On Fri, 2013-06-28 at 15:42 -0600, Chris Murphy wrote: > But the final release series of RC's happen very quickly, and any > allowed change is by definition significant (i.e. necessary) or it > simply wouldn't happen, but that also makes the change higher risk > than other changes. So I think more time padding is needed between an > RC and go/nogo. I think you may be labouring under a bit of a misapprehension about what should be tested, here. The distinction between a TC and an RC is not large. An RC can only happen after freeze and must have all blockers fixed: if a build after freeze doesn't have all blockers addressed, we call it a TC. We have gotten better at finding blockers earlier in recent releases. What this means is that we're doing fewer but _better_ RCs. Back around F14 or F15, our first 'RC' build was pretty much a joke; there was never any chance it was actually going to get released. We'd find five blockers in it straight away. I think in the last few release cycles, though, we've actually released RC1 or RC2 several times. I don't really see there being some big distinction between TCs and RCs. If you want to make sure some workflow that's important to you is going to work it's really a good idea to follow the process from TC1, there is no mileage in jumping in at RC1, that's too late. (Never mind the people who, every release, seem to jump in and start testing on the day we do go/no-go and then kick up a fuss about whatever they find...not you, but it does seem to happen). That doesn't quite apply to this specific case, as it happens, but it's an important point to make. Getting down to specifics: the change that we believe broke this - trying to re-use an existing EFI system partition if one is present instead of always creating a new one - went into anaconda 19.30.10. TC6 had 19.30.9, RC1 had 19.30.11; 19.30.10 probably only went into smoke test builds and we found some problem which necessitated 19.30.11. RC1 came out very early Tuesday morning (06-25) (2am Eastern time). If we assume this had been a blocker bug (which I still think it probably wasn't), that gave us about...62 hours to catch it before the sign-off happened. That is a pretty short timeframe, indeed. If we want to identify one specific Thing That Went Wrong here, I would say it's that we probably shouldn't have taken a moderately significant behaviour change as late as that. So let's look at that in a bit more detail: https://bugzilla.redhat.com/show_bug.cgi?id=974543 is the bug that prompted this change. It was filed on 06-14 (though we'd been aware of the behaviour for rather longer). It was proposed as a freeze exception issue by bcl (anaconda developer) on 06-17: that effectively means anaconda team was of the opinion that they wanted this change to go in. It was reviewed for freeze exception status on 06-19. The log of the review meeting is at http://meetbot.fedoraproject.org/fedora-blocker-review/2013-06-19/f19final-blocker-review-7.2013-06-19-16.01.log.txt . Here are the relevant bits extracted, since it's very short: 18:53:38 <adamw> https://bugzilla.redhat.com/show_bug.cgi?id=974543 18:54:20 <Viking-Ice> dances on the limit of blocker 18:56:37 <kparal> but we should definitely vote on 974543 18:56:48 <kparal> it's proposed and patches are ready 18:57:09 <adamw> +1 on 974543 18:57:20 <jreznik> +1 FE for 974543, seems like bcl wants this one 18:57:59 <tflink> #topic (974543) Anaconda is always creating new efi system partition 18:58:02 <tflink> #link https://bugzilla.redhat.com/show_bug.cgi?id=974543 18:58:04 <tflink> #info Proposed Freeze Exceptions, anaconda, NEW 18:58:10 <adamw> tflink: the patches are not sent to anaconda-devel-list so technically not 'post'ed 18:58:11 <adamw> +1 18:58:20 <kparal> +1 FE 18:58:20 <adamw> this is completely wrong behaviour and ought to be fixed 18:58:27 <nirik> +1 FE 18:58:31 <Viking-Ice> +1 18:58:35 <dgilmore> +1 FE 18:58:36 <jreznik> +1 FE 18:59:12 <adamw> shame to put it in this late, but otoh our 'multiboot uefi' story has never worked very well so unlikely to maek things worse 18:59:32 <tflink> proposed #agreed 974543 - AcceptedFreezeException - This behavior of creating new EFI partitions is not correct and should be fixed. A tested fix would be considered past freeze 18:59:33 <adamw> at some point we're going to run into the problem of what to do if there isn't enough space in the esp but we'll burn that bridge when we get to it 18:59:33 <adamw> ack 18:59:49 <jreznik> ack 18:59:57 <nirik> ack 18:59:59 <Viking-Ice> ack 18:59:59 <tflink> #agreed 974543 - AcceptedFreezeException - This behavior of creating new EFI partitions is not correct and should be fixed. A tested fix would be considered past freeze 19:00:00 <handsome_pirate> ack 19:00:20 <kparal> adamw: the files are very small, currently So it sailed through review with +1s from four QA folks (myself, Kamil, Tim (implied) and Johann), one from releng (dgilmore) and one from the program manager (jreznik). As has often been the case lately, no-one outside of those groups bothered to show up for the meeting. It had an implied +1 from the anaconda developers due to the fact that they had proposed it in the first place: we put a fairly high weight on that fact during review. I very perfunctorily mentioned that it was somewhat dangerous to poke it this late, but incorrectly (as it transpired) thought it was unlikely to make things any worse; I'm pretty sure at the time I just did not think of the possibility of anything like this bug arising. The fix was committed to anaconda git one day later: https://git.fedorahosted.org/cgit/anaconda.git/commit/?h=f19-branch&id=03be63fabad9aa52c7a19c68f289b248aa793bcc committer David Lehman <dlehman@xxxxxxxxxx> 2013-06-20 20:01:39 (GMT) We usually have a second 'gate' on FE issues at the point of composing an actual release image: the person requesting the compose (which is usually me) and the person doing it (which is usually dgilmore) tend to have a chat if either feels that it might now be too late to take one of the FE fixes safely. But this unfortunately doesn't really apply to anaconda changes, because they're basically a package deal. In a really extraordinary case we can go to the anaconda devs and ask them to back out a change we really don't want, but that's pretty rare and we didn't consider it in this case. So effectively we were committed to taking this change the moment it was committed to f19-branch in git. It's kind of interesting that we didn't get a compose that included the change for four days after that. Koji tells us that anaconda 19.30.10 was built Mon, 24 Jun 2013 21:50:49 UTC and 19.30.11 Mon, 24 Jun 2013 23:27:23 UTC, so a delay between .10 and .11 wasn't the problem. Instead the delay was between the change being committed to git and a new anaconda build happening at all - four days, which is quite a lot for this late in the release cycle. But most of the explanation for that is fairly mundane: the weekend. The commit was done in the middle of the day on a Thursday. I don't recall why a build wasn't done on the Friday: I think it may have been that we felt TC6 was a build we wanted to get a full validation run done on, and we wanted the next build to be an RC, and we had other blocker bugs to fix, so there wasn't felt to be any urgency to get a new compose out for testing. But for whatever reason, it wasn't. anaconda team does not work weekends, so the build happened on the next work day, Monday, and the RC1 compose happened shortly after. So it's a bit hard to say that this or that party clearly made a mistake, but I think it'd be reasonable to say that ultimately 'we' - QA, releng, anaconda devs - may not have made the best call in deciding to take that change at a point when we were pretty far along in pre-release stabilization. But anaconda FEs are always a somewhat tricky call, and there was a clear and substantial upside to this one (it's really not right at all to go around creating new ESPs on systems that already have them, and we were aware of cases where this just messed up boot, possibly even of *other* OSes). So I don't think it was egregiously the wrong decision. In practice it does not seem to have turned out for the best, though. -- Adam Williamson Fedora QA Community Monkey IRC: adamw | Twitter: AdamW_Fedora | identi.ca: adamwfedora http://www.happyassassin.net -- devel mailing list devel@xxxxxxxxxxxxxxxxxxxxxxx https://admin.fedoraproject.org/mailman/listinfo/devel