Re: Openshift 4 SOP PR review

Kevin Fenzi <kevin@xxxxxxxxx> · Fri, 24 Sep 2021 08:47:36 -0700

On Fri, Sep 24, 2021 at 12:01:27PM +0900, David Kirwan wrote:
> > On the storage, are we ok if a node goes down? ie, does it spread it
> > over all the storage nodes/raid? Or is it just in one place and you are
> > dead if that node dies?
> For storage, we maintain 3 replicas for data, spread across 3 nodes.
> However, not all nodes are equally resourced, we have 2 large nodes and 1
> much smaller, therefore, more replicas will be spread over these 2 larger
> nodes. We can likely afford to lose one of the large physical nodes while
> still maintaining data integrity.

ok. Fair enough. 

> > Is there any way to backup volumes?
> There are ways to clone, extend, take snapshots etc of these volumes. We've
> never done it, so it'll be a learning process for us all ;). We should sync
> to get a better handle on the requirements for backups. In CentOS CI we've
> set up backups to S3, we can certainly use some of that, eg: backup of
> etcd, but may need further investigation to backup the volumes managed by
> OCS. Will need to do some research here.

Yeah, backups of etcd would be nice, but mostly I was thinking of
applications that have persistent data. Right now we have those on
netapp NFS volumes, where it keeps snapshots and mirrors to another
site. I suppose we could just keep using NFS for data that has to
persist and just use local for other things, but it's sure nice to have
it dynamically provisioned. ;) 

> > should we make a playbooks/manual/ocp.yml playbook for things like
> > - list of clusteradmins
> > - list of clustermoniting
> > - anything else we want to manage post install
> Sure yep, as we're finishing up soonish, I'd imagine the next few weeks
> we'll all be back focused on the Infra/Releng tasks and will be focusing on
> tying up any loose ends like this, and starting migration of apps.

ok. 

> > Have we tried a upgrade of the clusters yet? Did everything go ok?
> > Do we need any docs on upgrades?
> Yes, we've already completed a number of upgrades, latest is to 4.8.11. We
> have SOPs for upgrades which we can copy over from the CentOS CI infra, and
> will make any updates required in the process.

Great.

> > Since the control plane are vm's I assume we need to drain them one at
> > a time to reboot the virthosts they are on?
> If we are rebooting a single vmhost/control plane VM at a time, yes that
> should be good. If we are doing more than 1 at the same time, we should do
> a full graceful cluster shutdown, and then a graceful cluster startup. We
> have SOPs for this in CentOS CI also, we'll get those added here and any
> content updates made.
> 
> > * Should we now delete the kubeadmin user? In 3.x I know they advise to
> > do that after auth is setup.
> We can delete it, as we have system:admin available from the os-control01
> node. Best practices might suggest we do. We can also give cluster-admin
> role to all users in the sysadmin-main and sysadmin-openshift groups.

Yeah, we should put this in the playbook so it's very clear who has this
and when it was added, etc. 

> I'm in two minds about deleting it, I was hoping to wait until we get a
> solution that syncs IPA groups/users to Openshift. There is an official
> supported solution for syncing LDAP (think that will work?).

Yeah, needs investigation. 

> > * Right now the api is only internal. Is it worth getting a forward
> > setup to allow folks to use oc locally on their machines? It would
> > expose that api to the world, but of course it would still need auth.
> We'd love to expose it, but.. all interaction with the clusters upto this
> point have also only been done via Ansible, so if it turns out we can't
> expose the API like this we're ok with that. With minor changes to the
> playbook we should be able to at least replicate the current 3.11
> experience.

Sure, but currently app owners can use oc on their local machines to
view logs, debug, etc. I think thats a nice thing to keep working.

> >> That's what we decided to do for the CentOS CI ocp setup, and so CI
> >> tenants can use oc from their laptop/infra. As long as cert exposed for
> >> default ingress has it added in the SAN, it works fine :
> >>
> >> X509v3 Subject Alternative Name:
> >>                 DNS:*.apps.ocp.ci.centos.org, DNS:api.ocp.ci.centos.org,
> >> DNS:apps.ocp.ci.centos.org
> 
> > Yeah, thats all fine, but to make it work for our setup, I would need to
> > get RHIT to nat in port 6443 to proxy01/10 from the internet. At least I
> > think thats the case. Openshift 3 could just use https, but alas, I fear
> > OCP4 needs that 6443 port.
> Yep think you're right on that.

I can put in a request for this. 

> >Do we want to try and enable http/2 ingress?
> https://docs.openshift.com/container-platform/4.5/networking/ingress-operator.html#nw-http2-haproxy_configuring-ingress
> We can take a look and see if we can figure it out!

ok. 

> > We will want to enable kubevirt/whatever it's called...
> We definitely want to make this available, but we will have to set quotas
> on usage. We should enable on staging, but should we enable on production?

We can start with staging and test and see what usage might be before we
go to prod. 

> On CentOS CI OCP4 cluster, we have Openshift Virtualization / kubevirt
> installed, but I don't think anyone is actually using *it*. We have several
> tenants which have elevated permissions, and are then accessing KVM
> directly to bring up VMs on the Openshift nodes, this is something we want
> to avoid, as we can't effectively set quotas on this type of usage.

Yeah, I think FCOS folks are a user there? They might be migrated to our
new cluster, so they would likely need the same perms here. ;( 

Thanks!

kevin
Attachment:
signature.asc

Description: PGP signature
_______________________________________________
infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure