Hi, so we had 3 incidents in the last 24h, and while all of them are different, they are also linked. So we did face several issues, starting by gerrit showing error 500 last night, around 23h Paris. That was https://bugzilla.redhat.com/show_bug.cgi?id=1620243 , and did result in a memory upgrade this morning. Then we started to look at others issues that were uncovered while investigating the first, and i tried to look at the size of the mail queue. Usually, this is not a problem, but after adding swap, it did become a issue. So I started to look for a way to blacklist mail sent to jenkins@xxxxxxxxxxxxxxxxx, first by routing this mail domain to supercolony, then by changing postifx to drop the mail. And then we got 2 issue at once, timeline in UTC Timeline -------- 13:42 misc add a MX for build.gluster.org in the zone. To do that, the dns zone was changed and build.gluster.org could no longer be a CNAME. 14:56 kaleb ping misc/nigel saying "there is a message about disk full on that job" 15:00 misc click on the link to build.gluster.org, is greeted by a ssl error about certificat. Seems the DNS now resolve build.gluster.org to 2 IP instead of 1 15:04 misc revert the DNS, cause no time to investigate. 15:05 misc figure the server has a full disk because the logs are stored on / 15:07 misc also start to swear in 2 languages 15:18 a new partition with more space is created on http.int.rht.gluster.org data is copied, httpd restarted, situation is back to normal Impact: - some build logs were lost (likely not much) - for 1h, some people could have been randomly directed to the wrong server when going to build.gluster.org Root cause: - for DNS, a wrong commit. The syntax did look correct (and was verified), so I need to check why it did more than required. - for the disk full, a increase of patches and a oversight on that server installation. Resolution: - dns got reverted - new partition was added and data were copied What went well: - we were quickly able to resolve the issue thanks to automation When we were lucky: - the issue got detected fast by the same person who made the change (DNS), and people (Kaleb) notified us as soon as something seemed weird (disk) - none of us were in Vancouver facing a measle outbreak What went bad - still no monitoring Potential improvement to make: - add monitoring - revise ressources usage - prepare a template for post mortem -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS
Attachment:
signature.asc
Description: This is a digitally signed message part
_______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-devel