Using openQA (was Re: [Test-Announce] openQA downtime today)

Adam Williamson <adamwill@xxxxxxxxxxxxxxxxx> · Tue, 05 Jul 2016 14:25:51 -0700

On Tue, 2016-07-05 at 22:17 +0300, Mohammed Tayeh wrote:
> Hi
> We need some to explain how to use openQA for me and new member
> in a short steps 😀😀

Hi Mohamed! :)

openQA isn't exactly something you 'use' in most cases. It's an
automated test system that we've been using for the last couple of
cycles; the main goal was to reduce the manual release validation
testing workload.

So all the tests we've implemented in openQA so far are automated
versions of the release validation test cases - the same tests you see
linked from the release validation pages, like:

https://fedoraproject.org/wiki/Test_Results:Fedora_25_Rawhide_20160704.n.0_Installation

every result you see there from 'coconut' with the bot icon was
actually produced by openQA.

The way it's set up at present is that every time releng produces a
compose - whether it's a nightly compose or a candidate of any kind -
all the openQA tests will be run for it. Each time an openQA test
passses, a little intermediary between openQA and the wiki looks to see
if that compose was 'nominated' for testing - i.e., if there are
validation test pages on the Wiki for it - and converts the openQA
result to one or more 'pass' wiki results, if so.

So ultimately, openQA saves us running those tests manually.

That was all we initially intended openQA to do, but since it runs on
every compose we took the opportunity to build out a couple of other
things around it. You've probably seen the 'compose check report'
emails that are sent to this list every time a compose is built: those
list all failed openQA tests for the compose. The idea there was just
to give people a convenient way to see roughly how well each day's
compose is working - if lots and lots of tests failed, obviously it's
pretty bad and you might want to avoid using it.

There's also the 'nightly compose finder' I wrote last cycle:

https://www.happyassassin.net/nightlies.html

the point of that is just to provide a convenient way to find the most
recent compose of each image, and when possible also a way to find the
most recent compose of each image that passed all its tests. It takes
test results from both openQA and autocloud, which is a separate
automated testing system for cloud images.

So a lot of what openQA's intended to do is just sort of sit there and
run tests and provide us with information in various ways; you don't
have to 'use' it exactly. But you *can* interact with it for a few
reasons.

The most obvious is to look at the test results directly - where you
get a lot more detail than the email reports or the wiki or the nightly
finder - and when a test failed, figure out why and file a bug. :) Me
and Jan Sedlak already try to keep on top of this, but of course if
anyone else wants to learn how to do it and help us out that'd be
great. Here's a quick starter guide!

The main starter pages in openQA (for me) are the overviews of results
for a single compose. For instance, here's the overview for today's
Rawhide nightly:

https://openqa.fedoraproject.org/tests/overview?distri=fedora&version=Rawhide&build=Fedora-Rawhide-20160705.n.0&groupid=1

you can find the last three composes from the front page, and you can
click 'fedora' on the front page to get several more before that:

https://openqa.fedoraproject.org/group_overview/1

if you want to find the overview for an even older compose you can edit
the URL for a newer compose and just change the compose ID.

Let's go back to the overview for today's Rawhide nightly. You'll see
several tables, with titles like 'Flavor: Atomic-boot-iso'. The
'flavors' are basically the different images; we have image-specific
tests for several different images. There is also a special 'universal'
flavor which contains tests that can be run (more or less) on any
installer image, these are usually run on the Server DVD but will fall
back to another image if that one isn't available.

In each table you'll see a row for each test that's run for that
'flavor', with columns for each arch (we currently only run openQA on
i386 and x86_64). Some tests are run on x86_64 BIOS and UEFI; the UEFI
test has '@uefi' appended to its name.

For each arch that each test is run on (not every arch runs on every
test) you'll see a colored circle. The color of the circle represents
the state or result of the test. Note these colors actually changed a
bit with the update today - I'll tell you the new colors, not the old
ones:

* Dark blue means the test is scheduled to run but hasn't started yet
* Light blue means it's running right now
* Green means it finished and passed
* Orange means it finished and 'soft failed' (which is more or less
like a 'warn' on the wiki - the test more or less passed but did run
into a non-fatal bug along the way, e.g. right now the F24->F25 upgrade
tests 'soft fail' because they have to pass enforcing=0 to work around 
https://bugzilla.redhat.com/show_bug.cgi?id=1349721 ;)
* Red means it failed
* Dark red means it couldn't even run at all (usually because we messed
up the disk images or something, you should rarely see this in prod)
* Grey means it was skipped for some reason, usually this happens when
it depends on another test which failed

Clicking on the circle takes you to the detailed page for that specific
test (or 'job' in openQA terms). Let's look at a failed test:

https://openqa.fedoraproject.org/tests/24725

So how do we figure out what went wrong? Well, it helps to know roughly
how openQA works. Very simply, what openQA does is run through a
sequence of pre-planned actions - key presses and mouse movements - and
checks every so often that the screen looks the way it should at this
point in the process. Every time one of these screen matches passes or
fails, it takes a screenshot. In this view, you see a bunch of
thumbnails. A thumbnail with a green surround is a *passed* match. A
thumbnail with a red surround is a *failed* match. A thumbnail with a
grey surround doesn't represent a match but was taken for some other
reason (openQA will take these 'informational' screenshots every so
often as it goes along, there are various conditions as to why).
Usually, when you're looking at a failed test, you'll see a red match
somewhere. Here we can see it in the _do_install_and_reboot test:

https://openqa.fedoraproject.org/tests/24725#step/_do_install_and_reboot/33

In this case it's pretty obvious what's gone wrong, as the installer's
showing an error message. But sometimes it'll be less obvious. The
"Candidate needle" drop-down lets you see what openQA was expecting to
see at this point and compare it to what it's actually seeing: you can
pick any of the 'needles' (the reference screens) that openQA was
looking for, and compare them to what's actually on screen.

In this case openQA was expecting the 'install complete' screen to show
up at some point, only it never did, because there was an error
installing the bootloader. So it eventually just times out and gives
up.

So OK, now we know the install failed at the point of trying to write
the UEFI boot loader. Cool! This is already good information. But we
can get more.

Look up at near the top of the screen and you'll see there are a few
tabs on this 'job' view - we're on the Details tab, but there are also
Logs & Assets, Settings, Comments and Previous results. Logs & Assets
is the really useful one here. So let's go there.

For *any* test that actually managed to run, you'll get a few things.
vars.json is the openQA settings variables that were set for this test
(I think at the time it failed), this isn't often super useful (mostly
for diagnosing broken tests). serial0.txt is the log of the serial
output (openQA uses this for various things; it's the main channel for
getting analyzable text into and out of the test system). autoinst-
log.txt is basically openQA's log of the actual test process, it's very
very verbose and can be hard to read but it provides all the nitty-
gritty details on what openQA was actually *doing*, what screens it was
looking for and what it was typing and clicking and where it was moving
the mouse and so on. Most obviously useful is the Video. Yup, for every
single test, there's a (substantially sped-up) video recording you can
watch, which is obviously really useful for figuring out what actually
happened.

For some tests - like this one - you'll also find uploaded files from
the test system (these are labelled 'Uploaded Logs' but they don't have
to be logs, tests can be set to upload *any* file from the test box).
Our tests are set up such that when an install test fails, openQA will
try to go to a console and upload all the anaconda logs, plus /var/log
and /var/tmp (where anaconda crash tracebacks go). So we can actually
read the installer logs from the test! In this case I happen to know
that program.log is usually the most useful in diagnosing bootloader
install fails, so I can go look at it:

https://openqa.fedoraproject.org/tests/24725/file/_do_install_and_reboot-program.log

and down the bottom we see the actual errors:

06:32:34,412 INFO program: Running... efibootmgr
06:32:34,586 DEBUG program: Return code: -11
06:32:34,587 INFO program: Running... efibootmgr -c -w -L Fedora -d /dev/vda -p 1 -l \EFI\fedora\shim.efi
06:32:34,674 DEBUG program: Return code: -11

those efibootmgr calls should be returning 0 (return code 0 always
means 'success', anything non-0 is bad). That's the problem here.

At this point we can use another of the tabs on the detailed job view,
the 'Previous Results' tab. This is really useful because it often lets
you pinpoint exactly when something broke, which is obviously a big
help in fixing it. So if we look at that tab, we can see the test
failed at the same point for each of the last four days, but it worked
fine on 20160630. So we now know (or at least strongly suspect) that
something that changed between Fedora-Rawhide-20160630.n.0 and Fedora-
Rawhide-20160701.n.0 is what broke this.

At that point I can go look at the 'Rawhide report' email for Fedora-
Rawhide-20160701.n.0:

https://lists.fedoraproject.org/archives/list/test@xxxxxxxxxxxxxxxxxxxxxxx/message/BS5XIER32BKG3BR6KPPZYJKANTT6QJLE/

and hey, look at that, I see this package change:

Package:      efivar-0.24-1.fc25
Old package:  efivar-0.23-1.fc24
Summary:      Tools to manage UEFI variables
RPMs:         efivar efivar-devel efivar-libs
Size:         228452 bytes
Size change:  1600 bytes
Changelog:
  * Thu Jun 30 2016 Peter Jones <pjones@xxxxxxxxxx> - 0.24-1
  - Update to 0.24

That sure sounds like it might be related, huh? So now I can go file a
bug that tells the packager:

* UEFI installs started failing on 2016-07-01
* Here's the efibootmgr messages from the log
* This efivar bump sure looks like it might be the cause

...and in fact that's exactly what I did:

https://bugzilla.redhat.com/show_bug.cgi?id=1352680

and once pjones is back from vacation, he'll fix it. :)

There are various other details and things, but that's the basic
process of looking at an openQA test and figuring out what went wrong.
Please do ask if you have any follow-up questions!

The other thing you can interact with openQA for is to actually write
or modify the tests it runs, which is a whole other topic :)
-- 
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net
--
test mailing list
test@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe:
https://lists.fedoraproject.org/admin/lists/test@xxxxxxxxxxxxxxxxxxxxxxx