On Mon, 2016-01-25 at 12:21 -0800, Adam Williamson wrote: > On Mon, 2016-01-25 at 15:57 +0000, Fedora compose checker wrote: > > Missing expected images: > > > > Kde disk raw armhfp > > > > No images in this compose but not Rawhide 20160124 > > > > Images in Rawhide 20160124 but not this: > > > > Cloud_atomic vagrant virtualbox x86_64 > > Cloud_atomic vagrant libvirt x86_64 > > Looks like there's some kind of timing issue going on here, causing the > compose check to get tired of waiting for the openQA tests to complete > and send the email before they're done. Most likely the compose is > turning up later than it used to. I'll look into it and make whatever > adjustments to the timers/timeouts seem necessary, sorry for the > incomplete reports! Ah, so unfortunately it's not quite so simple. What actually happened was a problem while check-compose was waiting for the openQA tests to complete. What check-compose does is it asks openQA if all the tests for the compose being checked are done. If they're not yet, it goes to sleep for a couple of minutes, wakes up, and asks again. It keeps doing that until all the tests are done (or it hits a configurable timeout). If it gets some kind of erroneous response when it tries to talk to openQA, it sleeps 5 seconds and tries again. It does that cycle 5 times, but if it hits 5 bad responses in a row, it decides something's seriously wrong with openQA, and gives up waiting. It also gives up waiting *immediately* if it gets a ConnectionError from python- requests. If check-compose gives up waiting, it sends out the report immediately - without the openQA results. That's what happened here: check-compose woke up and waited for results as usual for 50 minutes or so, but then it hit some kind of error and gave up waiting. I *think* it probably hit the ConnectionError case. Looking at the Apache logs it seems like the openQA web server stopped responding to requests briefly around the time of the check-compose failure. (openQA uses a reverse proxy setup; openQA itself runs a non- externally-accessible web server process on a non-standard port, Apache faces outward and proxies requests for the openQA domain through to the openQA server). This...seems to happen occasionally. It used to happen quite a lot when the openQA server box also ran all the tests - heavy test load would cause the server to stop responding. It happens much less now the server VMs aren't also running tests, but it seems like it did happen last night, it's *possibly* happening when we're saving hard disk images from completed tests to use as a base for later tests (the worker box has to upload a rather large disk image file to the openQA server, which seems like it can cause the server to struggle). For now I think I'm gonna make the 'wait' code (which is really in the openQA python client, not in check-compose itself) a bit more forgiving - have it retry for a bit longer than ~30 seconds, and have it retry on ConnectionError too, instead of immediately bailing. (I can't find a very detailed reference on all the situations in which python-requests raises ConnectionError - the best I can find is "In the event of a network problem (e.g. DNS failure, refused connection, etc), Requests will raise a ConnectionError exception.", which is kind of...lacking...so just treating it as something we might recover from seems reasonable). Hopefully that'll mitigate this for now. In the Glorious Future, of course, openQA sends out a fedmsg when it completes the tests for a compose, so we don't have to have things sit around 'waiting' for the tests to complete by yelling "ARE WE THERE YET?!" at the server every two minutes. Implementation of the Glorious Future is............scheduled. -- Adam Williamson Fedora QA Community Monkey IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net http://www.happyassassin.net -- devel mailing list devel@xxxxxxxxxxxxxxxxxxxxxxx http://lists.fedoraproject.org/admin/lists/devel@xxxxxxxxxxxxxxxxxxxxxxx