Date: 23 octobre 2018 Participating people: - misc - nigelb - kaleb Summary: A DNS change to move download.gluster.org vhost from a server in rackspace to a reverse proxy did result in a redirection loop, thus breaking the service for some people during 30 minutes. Impact: Some people using download.gluster.org Root cause: Download.gluster.org, as a vhost, was served out of download01.rax.gluster.org, and in order to prepare for a move out of Rackspace, the vhost was moved to the set of reverse proxy, as this would permit a quick and painless switch from 1 server to the others without the delay and unreliability of DNS update. The move was scheduled for 10h UTC, but happened at 10h30, due to various unrelated issues. However, while the move was tested yesterday, the issue of SSL redirection was missed, and since download01.rax.gluster.org was forcing SSL, the proxy was going there using standard http, thus resulting in a loop affecting firefox and curl. While looking at that issue, it was also found that the proxies, despites being ipv4 only hosts, where still trying to connect using ipv6 from time to time to the backend server, resulting in either delay and/or errors for clients. A quick fix was tried, but this did not fix the problem, so IPv6 was removed from DNS. Resolution: - disable the ssl redirection on the backend server - disable ipv6 on the proxy, and remove the ipv6 address What went well: - nginx was able to cope with the redirection loop quite well, cf the graph: https://munin.gluster.org/munin/rht.gluster.org/proxy02.rht.gluster. org/fw_conntrack.html - the problem was quickly identified When we were lucky: - no one was in the office, and I didn't participate to the party in the other office, so my lunch break What went bad: - supervision didn't detect anything, so I didn't got paged - our logs are full of deprecated urls being hits, thus making harder to notice a real issue Timeline (in UTC): 11:52 nigelb and kkeithley ping misc on irc, as well as send him a email 12:08 misc see he got contacted and look at the issue 12:14 Issue is identified and a quick fix is tested 12:18 Fix is properly deployed and committed in the repo 12:25 Watching the log to see if all is well, a ipv6 issue is identified 12:26 A quick fix is tested for the ipv6, which do reduce errors but not much 12:29 IPv6 record is removed for the backend server Potential improvement to make: - nginx logs could be improved, work is on its way for that - supervision should be able to detect this kind of incidents - misc should have been clearer on way to contact in case of issue - lots of people are using deprecated urls and locations, which is not great - get ipv6 working -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS
Attachment:
signature.asc
Description: This is a digitally signed message part
_______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-devel