Re: [Gluster-infra] Reboot policy for the infra

Michael Scherer <mscherer@xxxxxxxxxx> · Thu, 23 Aug 2018 09:49:44 +0200

Le jeudi 23 août 2018 à 11:21 +0530, Nigel Babu a écrit :
> One more piece that's missing is when we'll restart the physical
> servers.
> That seems to be entirely missing. The rest looks good to me and I'm
> happy
> to add an item to next sprint to automate the node rebooting.

That's covered as "as critical as the services that depend on them.

Now, the problem I do have is that some server (myrmicinae to name it)
do take 30 minutes to reboot, and I can't diagnose nor fix without
taking hours. This is the one running gerrit/jenkins, so that's not
possible to spent time on this kind of test.

> On Tue, Aug 21, 2018 at 9:56 PM Michael Scherer <mscherer@xxxxxxxxxx>
> wrote:
> 
> > Hi,
> > 
> > so that's kernel reboot time again, this time courtesy of Intel
> > (again). I do not consider the issue to be "OMG the sky is
> > falling",
> > but enough to take time to streamline our process to reboot.
> > 
> > 
> > 
> > Currently, we do not have a policy or anything, and I think the
> > negociation time around that is cumbersome:
> > - we need to reach people, which take time and add latency (would
> > be
> > bad if that was a urgent issue, and likely add undeed stress while
> > waiting)
> > 
> > - we need to keep track of what was supposed to be done, which is
> > also
> > cumbersome
> > 
> > While that's not a problem if I had only gluster to deal with, my
> > team
> > of 3 do have to deal with a few more projects than 1, and
> > orchestrating
> > choice for a dozen of group is time consuming (just think last time
> > you
> > had to go to a restaurant after a conference to see how hard it is
> > to
> > reach agreements).
> > 
> > So I would propose that we simplify that with the following policy:
> > 
> > - Jenkins builder would be reboot by jenkins on a regular basis.
> > I do not know how we can do that, but given that we have enough
> > node to
> > sustain builds, it shouldn't impact developpers in a big way. The
> > only
> > exception is the freebsd builder, since we only have 1 functionnal
> > at
> > the moment. But once the 2nd is working, it should be treated like
> > the
> > others.
> > 
> > - service in HA (firewall, reverse proxy, internal squid/DNS) would
> > be
> > reboot during the day without notice. Due to working HA, that's non
> > user impacting. In fact, that's already what I do.
> > 
> > - service not in HA should be pushed for HA (gerrit might get there
> > one
> > day, no way for jenkins :/, need to see for postgres and so
> > fstat/softserve, and maybe try to get something for
> > download.gluster.org)
> > 
> > - service critical and not in HA should be announced in advance.
> > Critical mean the service listed here: https://gluster-infra-docs.r
> > eadt
> > hedocs.io/emergency.html
> > 
> > - service non visible to end user (backup servers, ansible
> > deployment
> > etc) can be reboot at will
> > 
> > Then the only question is what about stuff not in the previous
> > category, like softserve, fstat.
> > 
> > Also, all dependencies are as critical as the most critical service
> > that depend on them. So hypervisors hosting gerrit/jenkins are
> > critical
> > (until we find a way to avoid outage), the ones for builders are
> > not.
> > 
> > 
> > 
> > Thoughts, ideas ?
> > 
> > 
> > --
> > Michael Scherer
> > Sysadmin, Community Infrastructure and Platform, OSAS
> > 
> > _______________________________________________
> > Gluster-infra mailing list
> > Gluster-infra@xxxxxxxxxxx
> > https://lists.gluster.org/mailman/listinfo/gluster-infra
> 
> 
> 
-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS

Attachment:
signature.asc

Description: This is a digitally signed message part
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel