Postmortem for Jenkins Outage on 20/07/18

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello folks,

I had to take down Jenkins for some time today. The server ran out of space and was silently ignoring Gerrit requests for new jobs. If you think one of your jobs needed a smoke or regression run and it wasn't triggered, this is the root cause. Please retrigger your jobs.

## Summary of Impact
Jenkins jobs not triggered intermittently in the last couple of days. At the moment, we do not have numbers on how many developers were affected by this. This would be mitigated slightly every day due to the rotation rules we have in place causing issues only around evening IST when we retrigger our regular nightly jobs.

## Timeline of Events.
July 19 evening: I've noticed since yesterday that occasionally Jenkins would not trigger a job for a push. This was on the build-jobs repo. I chalked it to a signal getting lost in the noise and decided to debug later. I could trigger it manually, so I put as a thing to do in the morning. Today morning, I found that jobs are getting triggered as they should and could not notice anything untoward.

July 20 6:41 pm: Kotresh pinged me asking if there was a problem. I could see the problem I noticed yesterday in his job. This time a manual trigger did not work. Around the same time Raghavendra Gowdappa also hit the same problem. I logged into the server to notice that the Jenkins partition was out of space.

July 20 7:40 pm: Jenkins is back online completely. A retrigger of the two failing jobs have been successful.

## Root Cause
* Out of disk space on the Jenkins partition on build.gluster.org
* The bugzilla-post did not delete old jobs and we had about 7000 jobs in there consuming about 20G of space.
* clang-scan job consumes about 1G per job and we were storing about 30 days worth of archives.

## Resolution
* All centos6-regression jobs are now deleted. We moved over to centos7-regression a while ago.
* We now only store 7 days of archives for bugzilla-post and clang-scan jobs

## Future Recommendation
* Our monitoring did not alert us about the disk being filled up on the Jenkins node. Ideally, we should have gotten a warning when we were at least 90% full so we could plan for additional capacity or look for mistakes in patterns.
* All jobs need to have a property that discards old runs with the maxmium of 90 days being kept in case it's absolutely needed. This is currently not enforced by CI but we will plan to enforce it in the future.

--
nigelb
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux