Re: Openshift 4 SOP PR review

David Kirwan <dkirwan@xxxxxxxxxx> · Tue, 28 Sep 2021 14:08:25 +0900

We've updated a number of SOPs since e.g. upgrades, if you could take another look, and +1 the PR if we're happy!

We've installed and tested Openshift Virtualization on the staging cluster, it's working good, currently need cluster-admin to use, In the coming days we'll continue to work on the remaining tasks such as quotas and permissions to easily hand out to users etc.

On Sat, 25 Sept 2021 at 00:55, Kevin Fenzi <kevin@xxxxxxxxx> wrote:
On Fri, Sep 24, 2021 at 12:01:27PM +0900, David Kirwan wrote:

> > On the storage, are we ok if a node goes down? ie, does it spread it

> > over all the storage nodes/raid? Or is it just in one place and you are

> > dead if that node dies?

> For storage, we maintain 3 replicas for data, spread across 3 nodes.

> However, not all nodes are equally resourced, we have 2 large nodes and 1

> much smaller, therefore, more replicas will be spread over these 2 larger

> nodes. We can likely afford to lose one of the large physical nodes while

> still maintaining data integrity.

ok. Fair enough. 

> > Is there any way to backup volumes?

> There are ways to clone, extend, take snapshots etc of these volumes. We've

> never done it, so it'll be a learning process for us all ;). We should sync

> to get a better handle on the requirements for backups. In CentOS CI we've

> set up backups to S3, we can certainly use some of that, eg: backup of

> etcd, but may need further investigation to backup the volumes managed by

> OCS. Will need to do some research here.

Yeah, backups of etcd would be nice, but mostly I was thinking of

applications that have persistent data. Right now we have those on

netapp NFS volumes, where it keeps snapshots and mirrors to another

site. I suppose we could just keep using NFS for data that has to

persist and just use local for other things, but it's sure nice to have

it dynamically provisioned. ;) 

> > should we make a playbooks/manual/ocp.yml playbook for things like

> > - list of clusteradmins

> > - list of clustermoniting

> > - anything else we want to manage post install

> Sure yep, as we're finishing up soonish, I'd imagine the next few weeks

> we'll all be back focused on the Infra/Releng tasks and will be focusing on

> tying up any loose ends like this, and starting migration of apps.

ok. 

> > Have we tried a upgrade of the clusters yet? Did everything go ok?

> > Do we need any docs on upgrades?

> Yes, we've already completed a number of upgrades, latest is to 4.8.11. We

> have SOPs for upgrades which we can copy over from the CentOS CI infra, and

> will make any updates required in the process.

Great.

> > Since the control plane are vm's I assume we need to drain them one at

> > a time to reboot the virthosts they are on?

> If we are rebooting a single vmhost/control plane VM at a time, yes that

> should be good. If we are doing more than 1 at the same time, we should do

> a full graceful cluster shutdown, and then a graceful cluster startup. We

> have SOPs for this in CentOS CI also, we'll get those added here and any

> content updates made.

> 

> > * Should we now delete the kubeadmin user? In 3.x I know they advise to

> > do that after auth is setup.

> We can delete it, as we have system:admin available from the os-control01

> node. Best practices might suggest we do. We can also give cluster-admin

> role to all users in the sysadmin-main and sysadmin-openshift groups.

Yeah, we should put this in the playbook so it's very clear who has this

and when it was added, etc. 

> I'm in two minds about deleting it, I was hoping to wait until we get a

> solution that syncs IPA groups/users to Openshift. There is an official

> supported solution for syncing LDAP (think that will work?).

Yeah, needs investigation. 

> > * Right now the api is only internal. Is it worth getting a forward

> > setup to allow folks to use oc locally on their machines? It would

> > expose that api to the world, but of course it would still need auth.

> We'd love to expose it, but.. all interaction with the clusters upto this

> point have also only been done via Ansible, so if it turns out we can't

> expose the API like this we're ok with that. With minor changes to the

> playbook we should be able to at least replicate the current 3.11

> experience.

Sure, but currently app owners can use oc on their local machines to

view logs, debug, etc. I think thats a nice thing to keep working.

> >> That's what we decided to do for the CentOS CI ocp setup, and so CI

> >> tenants can use oc from their laptop/infra. As long as cert exposed for

> >> default ingress has it added in the SAN, it works fine :

> >>

> >> X509v3 Subject Alternative Name:

> >>                 DNS:*.apps.ocp.ci.centos.org, DNS:api.ocp.ci.centos.org,

> >> DNS:apps.ocp.ci.centos.org

> 

> > Yeah, thats all fine, but to make it work for our setup, I would need to

> > get RHIT to nat in port 6443 to proxy01/10 from the internet. At least I

> > think thats the case. Openshift 3 could just use https, but alas, I fear

> > OCP4 needs that 6443 port.

> Yep think you're right on that.

I can put in a request for this. 

> >Do we want to try and enable http/2 ingress?

> https://docs.openshift.com/container-platform/4.5/networking/ingress-operator.html#nw-http2-haproxy_configuring-ingress

> We can take a look and see if we can figure it out!

ok. 

> > We will want to enable kubevirt/whatever it's called...

> We definitely want to make this available, but we will have to set quotas

> on usage. We should enable on staging, but should we enable on production?

We can start with staging and test and see what usage might be before we

go to prod. 

> On CentOS CI OCP4 cluster, we have Openshift Virtualization / kubevirt

> installed, but I don't think anyone is actually using *it*. We have several

> tenants which have elevated permissions, and are then accessing KVM

> directly to bring up VMs on the Openshift nodes, this is something we want

> to avoid, as we can't effectively set quotas on this type of usage.

Yeah, I think FCOS folks are a user there? They might be migrated to our

new cluster, so they would likely need the same perms here. ;( 

Thanks!

kevin

_______________________________________________

infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx

To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx

Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/

List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines

List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@xxxxxxxxxxxxxxxxxxxxxxx

Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure

-- 
David Kirwan
Software Engineer
Community Platform Engineering @ Red Hat
T: +(353) 86-8624108     IM: @dkirwan
_______________________________________________
infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure