Date: 20 november 2018 Participating people: - misc - obnox Summary: Our automated certificate renewal system failed to renew docs.gluster.org certificate, resulting in a expired certificate for around 6h. Our monitoring system decided detect the problem. Impact: Some people would had to accept a insecure certificate to read the website Root cause: So, on the monitoring side, it seems that "something" did broke alerting. However, upon restart and testing, it seems to be working fine now. However, the configuration by default do not seems to verify that the certificate is going to expire, and so verify only that the port 443 is open and a ssl request can be negociated. On the certificate renewal side, all is covered by ansible, and we do a automated run every night. A manual run didn't show any error, so my analysis point toward a failure of the automation. Looking at ant-queen, our deploy server, it seems that a issue on 2 internal builders (builder1 and builder31) created a deadlock when ansible tried to connect, and for some reason, didn't timeout. In turn, this did result in several process waiting on those 2 servers. Looking at the graph, we can see the problem started around 1 week ago: https://munin.gluster.org/munin/int.rht.gluster.org/builder1.int.rht.gluster.org/users.html Since our system will only trigger renewal if the certificate is going to expire in 1 week, this did result in the process not trying to renew for more than 1 week, and so the certificate expired. A quick look on builder1 and 31 show that the issue is likely due to regression testing. The command 'df' is blocked on builder1, and that's usually a sign of "something went wrong with the test suite". A look at the existing process hint the gd2 test suite, since there is etcd2 still running, and glusterfsd process too. Resolution: - misc ran the process manually, and the certificate got renewed - misc restarted nagios and alert started to work - misc went on a process cleaning spree, unlocking a achievement on Steam by stopping 70 of them in 1 command What went well: - people contacted us - only 1 certificate got impacted When we were lucky: - this only impacted docs.gluster.org, and a user workaround did exist What went bad: - supervision didn't paged anyone Timeline (in UTC): 05:00 the certificate expire. 09:30 misc decide to go to the office 09:50 misc arrive at the train station and get in the train, then connect on irc just in case 10:01 obnox ping misc on irc 10:02 misc say crap, take a look, confirm the issue 10:05 misc connect on ant-queen, run the deploy script after checking the 2 proxies are ok 10:07 misc see that the certificate got renewed and inspect ant-queen, see a bunch of process blocked on 2 servers 10:08 entering a tunnel, misc declare the issue be fixed and will look once in the office Potential improvement to make: - our supervision should check certificate validity. (should be easy) - our supervision should also verify that the we do not have something weird on ant-queen (less easy) - whatever caused nagios to fail should be investigated, and mitigated - whatever caused ansible to fail should be investigated, and mitigated - our gd2 test suite should clean itself in a more reliable way -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS
Attachment:
signature.asc
Description: This is a digitally signed message part
_______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-devel