Le jeudi 23 août 2018 à 11:37 +0300, Yaniv Kaul a écrit : > On Thu, Aug 23, 2018 at 10:49 AM, Michael Scherer <mscherer@xxxxxxxxx > m> > wrote: > > > Le jeudi 23 août 2018 à 11:21 +0530, Nigel Babu a écrit : > > > One more piece that's missing is when we'll restart the physical > > > servers. > > > That seems to be entirely missing. The rest looks good to me and > > > I'm > > > happy > > > to add an item to next sprint to automate the node rebooting. > > > > That's covered as "as critical as the services that depend on them. > > > > Now, the problem I do have is that some server (myrmicinae to name > > it) > > do take 30 minutes to reboot, and I can't diagnose nor fix without > > taking hours. This is the one running gerrit/jenkins, so that's not > > possible to spent time on this kind of test. > > > > You'd imagine people would move to kexec reboots for VMs by now. > Not sure why it's not catching up. > (BTW, is it taking time to shutdown or to bring up?) > Y. To bring up according to my notes. And I am not sure how kexec would work with microcode update. We also need to upgrade bios sometime :/ > > > > > > > > > > On Tue, Aug 21, 2018 at 9:56 PM Michael Scherer <mscherer@redhat. > > > com> > > > wrote: > > > > > > > Hi, > > > > > > > > so that's kernel reboot time again, this time courtesy of Intel > > > > (again). I do not consider the issue to be "OMG the sky is > > > > falling", > > > > but enough to take time to streamline our process to reboot. > > > > > > > > > > > > > > > > Currently, we do not have a policy or anything, and I think the > > > > negociation time around that is cumbersome: > > > > - we need to reach people, which take time and add latency > > > > (would > > > > be > > > > bad if that was a urgent issue, and likely add undeed stress > > > > while > > > > waiting) > > > > > > > > - we need to keep track of what was supposed to be done, which > > > > is > > > > also > > > > cumbersome > > > > > > > > While that's not a problem if I had only gluster to deal with, > > > > my > > > > team > > > > of 3 do have to deal with a few more projects than 1, and > > > > orchestrating > > > > choice for a dozen of group is time consuming (just think last > > > > time > > > > you > > > > had to go to a restaurant after a conference to see how hard it > > > > is > > > > to > > > > reach agreements). > > > > > > > > So I would propose that we simplify that with the following > > > > policy: > > > > > > > > - Jenkins builder would be reboot by jenkins on a regular > > > > basis. > > > > I do not know how we can do that, but given that we have enough > > > > node to > > > > sustain builds, it shouldn't impact developpers in a big way. > > > > The > > > > only > > > > exception is the freebsd builder, since we only have 1 > > > > functionnal > > > > at > > > > the moment. But once the 2nd is working, it should be treated > > > > like > > > > the > > > > others. > > > > > > > > - service in HA (firewall, reverse proxy, internal squid/DNS) > > > > would > > > > be > > > > reboot during the day without notice. Due to working HA, that's > > > > non > > > > user impacting. In fact, that's already what I do. > > > > > > > > - service not in HA should be pushed for HA (gerrit might get > > > > there > > > > one > > > > day, no way for jenkins :/, need to see for postgres and so > > > > fstat/softserve, and maybe try to get something for > > > > download.gluster.org) > > > > > > > > - service critical and not in HA should be announced in > > > > advance. > > > > Critical mean the service listed here: https://gluster-infra-do > > > > cs.r > > > > eadt > > > > hedocs.io/emergency.html > > > > > > > > - service non visible to end user (backup servers, ansible > > > > deployment > > > > etc) can be reboot at will > > > > > > > > Then the only question is what about stuff not in the previous > > > > category, like softserve, fstat. > > > > > > > > Also, all dependencies are as critical as the most critical > > > > service > > > > that depend on them. So hypervisors hosting gerrit/jenkins are > > > > critical > > > > (until we find a way to avoid outage), the ones for builders > > > > are > > > > not. > > > > > > > > > > > > > > > > Thoughts, ideas ? > > > > > > > > > > > > -- > > > > Michael Scherer > > > > Sysadmin, Community Infrastructure and Platform, OSAS > > > > > > > > _______________________________________________ > > > > Gluster-infra mailing list > > > > Gluster-infra@xxxxxxxxxxx > > > > https://lists.gluster.org/mailman/listinfo/gluster-infra > > > > > > > > > > > > > -- > > Michael Scherer > > Sysadmin, Community Infrastructure and Platform, OSAS > > > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel@xxxxxxxxxxx > > https://lists.gluster.org/mailman/listinfo/gluster-devel > > -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS
Attachment:
signature.asc
Description: This is a digitally signed message part
_______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-devel