Le jeudi 23 août 2018 à 11:21 +0530, Nigel Babu a écrit : > One more piece that's missing is when we'll restart the physical > servers. > That seems to be entirely missing. The rest looks good to me and I'm > happy > to add an item to next sprint to automate the node rebooting. That's covered as "as critical as the services that depend on them. Now, the problem I do have is that some server (myrmicinae to name it) do take 30 minutes to reboot, and I can't diagnose nor fix without taking hours. This is the one running gerrit/jenkins, so that's not possible to spent time on this kind of test. > On Tue, Aug 21, 2018 at 9:56 PM Michael Scherer <mscherer@xxxxxxxxxx> > wrote: > > > Hi, > > > > so that's kernel reboot time again, this time courtesy of Intel > > (again). I do not consider the issue to be "OMG the sky is > > falling", > > but enough to take time to streamline our process to reboot. > > > > > > > > Currently, we do not have a policy or anything, and I think the > > negociation time around that is cumbersome: > > - we need to reach people, which take time and add latency (would > > be > > bad if that was a urgent issue, and likely add undeed stress while > > waiting) > > > > - we need to keep track of what was supposed to be done, which is > > also > > cumbersome > > > > While that's not a problem if I had only gluster to deal with, my > > team > > of 3 do have to deal with a few more projects than 1, and > > orchestrating > > choice for a dozen of group is time consuming (just think last time > > you > > had to go to a restaurant after a conference to see how hard it is > > to > > reach agreements). > > > > So I would propose that we simplify that with the following policy: > > > > - Jenkins builder would be reboot by jenkins on a regular basis. > > I do not know how we can do that, but given that we have enough > > node to > > sustain builds, it shouldn't impact developpers in a big way. The > > only > > exception is the freebsd builder, since we only have 1 functionnal > > at > > the moment. But once the 2nd is working, it should be treated like > > the > > others. > > > > - service in HA (firewall, reverse proxy, internal squid/DNS) would > > be > > reboot during the day without notice. Due to working HA, that's non > > user impacting. In fact, that's already what I do. > > > > - service not in HA should be pushed for HA (gerrit might get there > > one > > day, no way for jenkins :/, need to see for postgres and so > > fstat/softserve, and maybe try to get something for > > download.gluster.org) > > > > - service critical and not in HA should be announced in advance. > > Critical mean the service listed here: https://gluster-infra-docs.r > > eadt > > hedocs.io/emergency.html > > > > - service non visible to end user (backup servers, ansible > > deployment > > etc) can be reboot at will > > > > Then the only question is what about stuff not in the previous > > category, like softserve, fstat. > > > > Also, all dependencies are as critical as the most critical service > > that depend on them. So hypervisors hosting gerrit/jenkins are > > critical > > (until we find a way to avoid outage), the ones for builders are > > not. > > > > > > > > Thoughts, ideas ? > > > > > > -- > > Michael Scherer > > Sysadmin, Community Infrastructure and Platform, OSAS > > > > _______________________________________________ > > Gluster-infra mailing list > > Gluster-infra@xxxxxxxxxxx > > https://lists.gluster.org/mailman/listinfo/gluster-infra > > > -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS
Attachment:
signature.asc
Description: This is a digitally signed message part
_______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-devel