Post mortem of 2018-08-23 (2 for the price of one)

Michael Scherer <mscherer@xxxxxxxxxx> · Thu, 23 Aug 2018 23:05:41 +0200

Hi,

so we had 3 incidents in the last 24h, and while all of them are
different, they are also linked.

So we did face several issues, starting by gerrit showing error 500
last night, around 23h Paris. 

That was https://bugzilla.redhat.com/show_bug.cgi?id=1620243 , and did
result in a memory upgrade this morning.

Then we started to look at others issues that were uncovered while
investigating the first, and i tried to look at the size of the mail
queue. Usually, this is not a problem, but after adding swap, it did
become a issue. 

So I started to look for a way to blacklist mail sent to
jenkins@xxxxxxxxxxxxxxxxx, first by routing this mail domain to
supercolony, then by changing postifx to drop the mail.

And then we got 2 issue at once, timeline in UTC 

Timeline
--------

13:42  misc add a MX for build.gluster.org in the zone. To do that, the
dns zone was changed and build.gluster.org could no longer be a CNAME. 

14:56  kaleb ping misc/nigel saying "there is a message about disk full
on that job"

15:00  misc click on the link to build.gluster.org, is greeted by a ssl
error about certificat. Seems the DNS now resolve build.gluster.org to
2 IP instead of 1

15:04  misc revert the DNS, cause no time to investigate. 

15:05  misc figure the server has a full disk because the logs are
stored on /

15:07  misc also start to swear in 2 languages

15:18  a new partition with more space is created on
http.int.rht.gluster.org data is copied, httpd restarted, situation is
back to normal

Impact:
- some build logs were lost (likely not much)
- for 1h, some people could have been randomly directed to the wrong
server when going to build.gluster.org

Root cause:
- for DNS, a wrong commit. The syntax did look correct (and was
verified), so I need to check why it did more than required.

- for the disk full, a increase of patches and a oversight on that
server installation.

Resolution:
- dns got reverted
- new partition was added and data were copied

What went well:
- we were quickly able to resolve the issue thanks to automation

When we were lucky:
- the issue got detected fast by the same person who made the change
(DNS), and people (Kaleb) notified us as soon as something seemed weird
(disk)
- none of us were in Vancouver facing a measle outbreak

What went bad
- still no monitoring

Potential improvement to make:
- add monitoring
- revise ressources usage 
- prepare a template for post mortem

-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS

Attachment:
signature.asc

Description: This is a digitally signed message part
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel