Re: Fedora Rawhide 20160125 compose check report

Adam Williamson <adamwill@xxxxxxxxxxxxxxxxx> · Mon, 25 Jan 2016 15:54:16 -0800

On Mon, 2016-01-25 at 12:21 -0800, Adam Williamson wrote:
> On Mon, 2016-01-25 at 15:57 +0000, Fedora compose checker wrote:
> > Missing expected images:
> > 
> > Kde disk raw armhfp
> > 
> > No images in this compose but not Rawhide 20160124
> > 
> > Images in Rawhide 20160124 but not this:
> > 
> > Cloud_atomic vagrant virtualbox x86_64
> > Cloud_atomic vagrant libvirt x86_64
> 
> Looks like there's some kind of timing issue going on here, causing the
> compose check to get tired of waiting for the openQA tests to complete
> and send the email before they're done. Most likely the compose is
> turning up later than it used to. I'll look into it and make whatever
> adjustments to the timers/timeouts seem necessary, sorry for the
> incomplete reports!

Ah, so unfortunately it's not quite so simple. What actually happened
was a problem while check-compose was waiting for the openQA tests to
complete.

What check-compose does is it asks openQA if all the tests for the
compose being checked are done. If they're not yet, it goes to sleep
for a couple of minutes, wakes up, and asks again. It keeps doing that
until all the tests are done (or it hits a configurable timeout).

If it gets some kind of erroneous response when it tries to talk to
openQA, it sleeps 5 seconds and tries again. It does that cycle 5
times, but if it hits 5 bad responses in a row, it decides something's
seriously wrong with openQA, and gives up waiting. It also gives up
waiting *immediately* if it gets a ConnectionError from python-
requests. If check-compose gives up waiting, it sends out the report
immediately - without the openQA results. That's what happened here:
check-compose woke up and waited for results as usual for 50 minutes or
so, but then it hit some kind of error and gave up waiting. I *think*
it probably hit the ConnectionError case.

Looking at the Apache logs it seems like the openQA web server stopped
responding to requests briefly around the time of the check-compose
failure. (openQA uses a reverse proxy setup; openQA itself runs a non-
externally-accessible web server process on a non-standard port, Apache
faces outward and proxies requests for the openQA domain through to the
openQA server). This...seems to happen occasionally. It used to happen
quite a lot when the openQA server box also ran all the tests - heavy
test load would cause the server to stop responding. It happens much
less now the server VMs aren't also running tests, but it seems like it
did happen last night, it's *possibly* happening when we're saving hard
disk images from completed tests to use as a base for later tests (the
worker box has to upload a rather large disk image file to the openQA
server, which seems like it can cause the server to struggle).

For now I think I'm gonna make the 'wait' code (which is really in the
openQA python client, not in check-compose itself) a bit more forgiving
- have it retry for a bit longer than ~30 seconds, and have it retry on
ConnectionError too, instead of immediately bailing. (I can't find a
very detailed reference on all the situations in which python-requests
raises ConnectionError - the best I can find is "In the event of a
network problem (e.g. DNS failure, refused connection, etc), Requests
will raise a ConnectionError exception.", which is kind
of...lacking...so just treating it as something we might recover from
seems reasonable). Hopefully that'll mitigate this for now.

In the Glorious Future, of course, openQA sends out a fedmsg when it
completes the tests for a compose, so we don't have to have things sit
around 'waiting' for the tests to complete by yelling "ARE WE THERE
YET?!" at the server every two minutes. Implementation of the Glorious
Future is............scheduled.
-- 
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net

--
test mailing list
test@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe:
http://lists.fedoraproject.org/admin/lists/test@xxxxxxxxxxxxxxxxxxxxxxx