Hi all, Below are the notes about Autocloud update happened last week. There were few things which went bad, few points worked well. I am trying to list them here, and how do we plan to make sure that the bad things never happen again. I am ccing patrick in this mail as he was the contact point from fedora infrastructure (read: he did all the work). ## deployment using ansible to the newly installed systems We only had to change yum to dnf, and also had to remove the old hotfix part from the roles. The current playbooks, roles seems to be stable, and can help us in the future too. ## fedmsg-hub on the backends double enqueue We had to restart fedmsg-hubs few times in the backends servers. In between we also found that fedmsg-hub service was happily enqueueing the jobs twice (for each compose), and then it got fixed automagically, nothing was changed in our configuration or in code. We are still not sure why this happened, but we are trying to dig more on this. ## fedmsg-hub broke due to a faulty dependency Even though we kept our code up and running for weeks, after the production deployment we found one of the dependency (fedfind, adamw is the upstream author) was broken with the fedora atomic image names, and causing our fedmsg-hub instances go crazy. We have informed upstream, and got a quick hotfix deployment in few hours after finding the issue. For the next release we will make sure if keep it running for longer on our internal hardware with messages from production fedmsg. This dependency failure was something we should have caught, but could not. ## missing fedmsg(s) from autocloud on completing the testing of a compose Now this was a known part of the whole development+release cycle. Sayan had submitted the patch [1], but there was some slight miscommunication. In my part i missed to track the state of this dependency. For the next release, i will make a release/deployment checklist for autocloud, and get it validated from everyone involved. Most probably we will add too many minor details to this checklist, but that will help us to keep things in track about any future deployment. sayan is currently working to get that particular change in production so that we can send out fedmsg(s) as required by adam. ## missing package dependency causing missing bridge on libvirt backend We also found that a missing link in the dependency chain caused a missing virtual bridge in the libvirt backend. Patrick helped to find that adding libvirt as dependency for that particular box will fix the issue in future. We should test on clean installations while developing next time to make sure this is not repeated. Plus we should think about getting better on the stage environment. ## the new webfrontend is better Sayan did a good job in making the new webfrontend. We can now point out to the exact failures [2]. In future we should try to get more input about the features of the webfrontend. Even though the whole service is made for automation, but this frontend helps us to find, and point to the right issues found in the tests. ## autocloud+tunir did what they are supposed to do After fixing the hiccups, the autocloud service is doing what it supposed to do, testing the images. We will push our effort in having better test coverage in the coming months to take the advantage from this new deployment. Please comment/suggest whatever you think about the work. This will help us to improve in the future releases. [1] https://github.com/fedora-infra/fedmsg_meta_fedora_infrastructure/pull/386 [2] https://apps.fedoraproject.org/autocloud/jobs/66/output#290 Kushal -- Fedora Cloud Engineer CPython Core Developer https://kushaldas.in https://dgplug.org _______________________________________________ infrastructure mailing list infrastructure@xxxxxxxxxxxxxxxxxxxxxxx https://lists.fedoraproject.org/admin/lists/infrastructure@xxxxxxxxxxxxxxxxxxxxxxx