[Fwd: [Gluster-infra] Reboot policy for the infra]

Michael Scherer <mscherer@xxxxxxxxxx> · Wed, 22 Aug 2018 10:29:01 +0200

Forward, cause I can't type gluster-devel properly, and use gluster-dev 
each time :p

-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS

--- Begin Message ---

To: gluster-infra@xxxxxxxxxxx, gluster-dev <gluster-dev@xxxxxxxxxxx>
Subject: [Gluster-infra] Reboot policy for the infra
From: Michael Scherer <mscherer@xxxxxxxxxx>
Date: Tue, 21 Aug 2018 18:25:20 +0200

Hi,

so that's kernel reboot time again, this time courtesy of Intel
(again). I do not consider the issue to be "OMG the sky is falling",
but enough to take time to streamline our process to reboot.

Currently, we do not have a policy or anything, and I think the
negociation time around that is cumbersome:
- we need to reach people, which take time and add latency (would be
bad if that was a urgent issue, and likely add undeed stress while
waiting)

- we need to keep track of what was supposed to be done, which is also
cumbersome

While that's not a problem if I had only gluster to deal with, my team
of 3 do have to deal with a few more projects than 1, and orchestrating
choice for a dozen of group is time consuming (just think last time you
had to go to a restaurant after a conference to see how hard it is to
reach agreements).

So I would propose that we simplify that with the following policy:

- Jenkins builder would be reboot by jenkins on a regular basis. 
I do not know how we can do that, but given that we have enough node to
sustain builds, it shouldn't impact developpers in a big way. The only
exception is the freebsd builder, since we only have 1 functionnal at
the moment. But once the 2nd is working, it should be treated like the
others.

- service in HA (firewall, reverse proxy, internal squid/DNS) would be
reboot during the day without notice. Due to working HA, that's non
user impacting. In fact, that's already what I do.

- service not in HA should be pushed for HA (gerrit might get there one
day, no way for jenkins :/, need to see for postgres and so
fstat/softserve, and maybe try to get something for
download.gluster.org)

- service critical and not in HA should be announced in advance.
Critical mean the service listed here: https://gluster-infra-docs.readt
hedocs.io/emergency.html

- service non visible to end user (backup servers, ansible deployment
etc) can be reboot at will

Then the only question is what about stuff not in the previous
category, like softserve, fstat.

Also, all dependencies are as critical as the most critical service
that depend on them. So hypervisors hosting gerrit/jenkins are critical
(until we find a way to avoid outage), the ones for builders are not.

Thoughts, ideas ?

-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS

Attachment:
signature.asc

Description: This is a digitally signed message part
_______________________________________________
Gluster-infra mailing list
Gluster-infra@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-infra

--- End Message ---
Attachment:
signature.asc

Description: This is a digitally signed message part
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel