Re: SELinux RPM scriplet issue annoucement

Adam Williamson <awilliam@xxxxxxxxxx> · Mon, 20 Jan 2014 09:48:28 -0800

On Mon, 2014-01-20 at 12:17 -0500, Matthew Miller wrote:
> On Sun, Jan 19, 2014 at 08:10:29PM +0100, Michael Schwendt wrote:
> > A simple "yum -y update ; reboot ; Oh, everything seems to work" has not
> > been enough this time. And it was an update with a screen full of ticket
> > numbers for the included bug-fixes/changes. It could have broken something
> > else, too.
> 
> Once we have a better automation framework in place, we can have tests like:
> install selinux update, reboot, install (special) test package version 1,
> update to test package version 2. (In addition to a series of other things
> that should work with selinux enabled.)
> 
> 
> > Btw, some other packages are in the same boat. Imagine a graphics driver
> > update "seems to work" for three testers that are required for a +3 vote
> > in the updates system, but fails badly for a hundred other users once it
> > appears in the stable updates repo.
> 
> That's a little harder, of course.

So I've read through this thread now. A few notes:

1) The precise nature of the failure here makes it a tricky issue to
deal with. We actually already know that this kind of 'delayed action'
bug is a tricky scenario to deal with, because we already have a whole
pretty well-known *category* of similar bugs: scriptlet errors
themselves. As Harald has pointed out, scriptlet errors are very messy
bugs that our current testing process is very poor at catching.

If anyone's not familiar with the scriptlet error category, see
https://fedoraproject.org/wiki/Common_F20_bugs#preun-fail .

So while the idea of an SELinux-specific 'update it, then update it
again' test case seems to make superficial sense, it's not actually an
SELinux-specific test. We should in fact be doing this for *all*
updates, or at the least, all updates that include any scriptlets.

However, it's not even that simple, because this is something that makes
much more sense to test in an automated way than manually - even more so
than many things. This specific bug was a bit easier to test than the
scriptlet case, because you just had to update *any other* package after
updating selinux-policy to see the bug, but it's clearly in the same
category as the more difficult case, and we should come up with an
approach that handles them all. What looks like the right approach has
already been suggested in the FESCo ticket on this: an automated test
that takes the update, bumps the spec one revision and tries to update.
So if the update is foo-1.1-2, the test would build a foo-1.1-3 package
with no other changes, and try updating from 1.1-2 to 1.1-3. Doing this
manually is of course a PITA and it's really a _very_ clear candidate
for automation. Such a test would, I believe, have caught the bug.

As posted to FESCo, though, it's still the case that we're working on
the automation framework at present and the tests come after that. We
are aiming to have the framework operational for the F21 cycle, AIUI,
and it may be plausible to implement this test during that cycle. As
such a test has several very desirable attributes - i.e. it catches bugs
which:

1) cause serious problems that are difficult to recover from
2) we are currently very bad at catching manually
3) would be difficult and onerous to reliably catch manually even with
improved manual testing procedures

I'd suggest this test should be a high priority for implementation once
taskotron is operational, perhaps equal in importance to re-implementing
the current AutoQA tests.

(Harald is probably correct to note that another bug of precisely this
type might result in 'innocent' updates being 'blamed' for being broken,
but we'd at least have a clear indication that something was seriously
boned, and could investigate/clean up manually - the proposed automated
test wouldn't make anything worse than it currently is).

1b) Just in case anyone had forgotten, though, we do have the
infrastructure for creating package-specific test cases that get
integrated with Bodhi to an extent, even though I don't think that's the
way to go in this particular case: see
https://fedoraproject.org/wiki/QA:SOP_package_test_plan_creation .

2) I already suggested to the SELinux devs on test@ that perhaps
selinux-policy updates should have a higher autokarma threshold, and
they agreed this might be a good idea. It would also be possible for
them to disable autopush for selinux-policy updates and handle pushing
them manually, based on whatever policy they choose, though of course
that's more work than using autopush.

3) Someone noted that big selinux-policy updates are 'scary'. I think to
be fair to the SELinux devs it's worth noting they push big updates all
the time,  with a very high success record. This is the first time I can
recall a bug anywhere near this serious happening with an SELinux update
to a stable release. AIUI, they have a very sensible policy for stable
release updates, which is that except in very exceptional cases, updates
can only make the policy *more liberal*, they cannot make it *tighter*.
The bug currently under discussion was caused by a change that came in
inadvertently, not intentionally, and was actually intended for Rawhide.

4) The FESCo ticket has an excellent and thoughtful discussion of the
proposal for a broad 'minimum time in updates-testing' policy to 'fix'
this problem - https://fedorahosted.org/fesco/ticket/1223#comment:5 -
and personally I agree with those who have commented on the ticket that
it is not the way to go.

5) Finally, my perennial note that the current update feedback system
(Bodhi 1.0) is nowhere near optimal. I think it's fair to say everyone
even casually related to the update process in any way is painfully
aware of this. I think Bodhi 2.0 has been just around the corner for,
um, three? four? years now - it's difficult to invest in trying to bodge
up improvements within the straitjacket of Bodhi 1.0's design (a single
numerical karma value, with only +1 and -1 adjustments being
'significant' so far as the tools and policies are concerned) when it
always seems like a drastically better design (Bodhi 2.0) is going to
arrive Real Soon Now, but I guess at *some* point we'd have to conclude
Bodhi 2.0 really isn't arriving Real Soon and go ahead and work with
what we have. I don't know how to quantify that point, though. All's I
can do is reiterate that yes, this is a really significant pain point in
our current processes, the proposed Bodhi 2.0 design would make things
almost immeasurably better, and plead with anyone reading this who has
the power to bump up the importance of / resources assigned to Bodhi
2.0's development to do so.
-- 
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net

-- 
devel mailing list
devel@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/devel
Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct