On 06/15/2015 04:19 PM, Kaushal M wrote: > Hi all, > > The recent rush of reviews being sent due to the release of 3.7 was a > cause of frustration for many of us because of the regression tests > (gerrit troubles themselves are another thing). > > W.R.T regression 3 main sources of frustration were, > 1. Spurious test failures > 2. Long wait times > 3. Regression slave troubles > > We've already tackled the spurious failure issue and are quite stable > now. The trouble with the slave vms is related to the gerrit issues, > and is mainly due to the network issues we are having between the > data-centers hosting the slaves and gerrit/jenkins. People have been > looking into this, but we haven't had much success. This leaves the > issue of the long wait times. > > The long wait times are because of the long queues of pending jobs, > some of which take days to get scheduled. Two things cause the long > queues, > 1. Automatic regression job triggering for all submissions to gerrit > 2. Long run time for regression (~2h) > > The long queues coupled with the spurious failure and network > problems, meant that jobs would fail for no reason after a long wait, > and would have to be added to the back of the queue to be re-run. This > meant that developers would have to wait days for their changes to get > merged, and was one of the causes for the delay in the release of 3.7. > > The solution reduce wait times for regression runs. To reduce wait > times we should, > 1. Trigger runs only when required > 2. Reduce regression run time. > > Raghavendra Talur (rtalur/RaSTar) will soon send out a mail with his > findings on the regression run times, and we can continue discussion > on it on that thread. > > Earlier, the regression runs used to be manually triggered by the > maintainers once they had determined that a change was ready for > submission. But as there were only two maintainers before (Vijay and > Avati) auto triggering was brought in to reduce their load. Auto > triggering worked fine when we had a lower volume of changes being > submitted to gerrit. But now, with the large volumes we see during the > release freeze dates, auto triggering just adds to problems. > > I propose that we move back to the old model of starting regression > runs only once the maintainers are ready to merge. But instead of the > maintainers manually tiggering the runs, we could automate it. > > We can model our new workflow on those of OpenStack[1] and > Wikimedia[2]. The existing Gerrit plugin for Jenkins doesn't provide > the features necessary to enable selective triggering based on Gerrit > flags. Both OpenStack and Wikimedia use a project gating tool called > Zuul[3], which provides a much better integration with Jenkins and > Gerrit and more features on top. > > I propose the following work flow, > > - Developer pushes change to Gerrit. > - Zuul is notified by Gerrit of new change > - Zuul runs pre-review checks on Jenkins. This will be the current smoke tests. > - Zuul reports back status of the checks to Gerrit. > - If checks fail, developer will need to resend the change after > the required fixes. The process starts once more. > - If the checks pass, the change is now ready for review > - The change is now reviewed by other developers and maintainers. > Non-maintainers will be able to give only a +1 review. > - On a negative review, the developer will need to rework the change > and resend it. The process starts once more. > - The maintainer give a +2 review once he/she is satisfied. The > maintainers work is done here. > - Zuul is notified of the +2 review > - Zuul runs the regression runs and reports back the status. > - If the regression runs fail, the process starts over again. > - If the runs pass, the change is ready for acceptance. > - Zuul will pick the change into the repository. > - If the pick fails, Zuul will report back the failure, and the > process starts once again. > > Following this flow should, > 1. Reduce regression wait time > 2. Improve change acceptance time > 3. Reduce unnecessary wastage of infra resources > 4. Improve infra stability. > > It also brings in drawbacks that we need to maintain one other piece > of infra (Zuul). This would be an additional maintenance overhead on > top of Gerrit, Jenkins and the current slaves. But I feel the > reduction in the upkeep efforts of the slaves would be enough to > offset this. > > tl;dr > Current auto-triggering of regression runs is stupid and a waste of > time and resources. Bring in a project gating system, Zuul, which can > do a much more intelligent jobs triggering, and use it to > automatically trigger regression only for changes with Reviewed+2 and > automatically merge ones that pass. > > What does the community think of this? +1 > > ~kaushal > > [1]: http://docs.openstack.org/infra/manual/developers.html#automated-testing > [2]: https://www.mediawiki.org/wiki/Continuous_integration/Workflow > [3]: http://docs.openstack.org/infra/zuul/ > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxxx > http://www.gluster.org/mailman/listinfo/gluster-devel > -- ~Atin _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel