Stable releases preparation temporarily stalled

Loic Dachary <loic@xxxxxxxxxxx> · Wed, 6 Jan 2016 15:30:46 +0100

Hi,

The stable releases (hammer, infernalis) did not make progress in the past few weeks because we can't run tests.

Before xmas the following happened:

* the sepia lab was migrated and we discovered the OpenStack teuthology backend can't run without it (that was a problem during a few days only)
* there are OpenStack specific failures in each teuthology suites and it is non trivial to separate them from genuine backport errors
* the make check bot went down (it was partially running on my private hardware)

If we just wait, I'm not sure when we will be able to resume our work because:

* the sepia lab is back but has less horsepower than it did
* not all of us have access to the sepia lab
* the make check bot is being worked on by the infrastructure team but it is low priority and it may take weeks before it's back online
* the ceph-qa-suite errors that are OpenStack specific are low priority and it may never be fixed

I think we should rely on the sepia lab for testing for the foreseeable future and wait for the make check bot to be back. Tests will take a long time to run, but we've been able to work with a one week delay before so it's not a blocker.

Although fixing OpenStack specific errors would allow us to use the teuthology OpenStack backend (I will fix the last error left in the rados suite), it is unrealistic to set that as a requirement to run tests: we don't have the workforce nor the skills to do that. Hopefully, some time in the future, Ceph developers will  use ceph-qa-suite on OpenStack as part of the development workflow. But right now running ceph-qa-suite on OpenStack suites is outside of the development workflow and in a state of continuous regression which is inconvenient for us because we need something stable to compare the runs from the integration branch.

Fixing the make check bot is a two part problem. Each failed run must be looked at to chase false negatives (continuous integration with false negatives is a plague), which I did in the past year on a daily basis and I'm happy to keep doing. Before xmas break the bot running at jenkins.ceph.com sent over 90% false negative, primarily because it was trying to run on unsupported operating systems and it was stopped until this is fixed. It also appears that the machine running the bot is not re-imaged after each test, meaning a bugous run may taint all future tests and create a continuous flow of false negative. Addressing these two issues require knowing or learning about the Ceph jenkins setup and slave provisioning. This probably is a few days of work, reason why the infrastructure team can't resolve that immediately.

If you have alternative creative ideas on how to improve the current situation, please speak up :-)

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html