Datacenter move days 3 and 4

Kevin Fenzi <kevin@xxxxxxxxx> · Thu, 11 Jun 2020 21:09:53 -0700

Greetings all. 

This email will cover days 3 and 4, as by the time I was going to send
yesterdays it was late and mailman was still down anyhow. :) 

So, yesterday started out seeming like a pretty simple day, but didn't
turn out that way. We planned to move only two things and work on fixing
issues from the buildsystem and other moves in the first two days. 

* datagrepper / datanommer. This took until this morning as the database
is really gigantic. Again, we wanted to load it into a more modern
postgres. Now that it's moved and on postgres 12.2, we will be looking
into partitioning the data (perhaps by month? quarter?) so queries for
anything recent are much faster.

* mailman / lists: This turned out to be our biggest problem of the
move. :( We are working on getting this install moved over to recent
fedora or rhel, but for now it's rhel7 and python34. Because of that we
decided to just copy the instance over entire and adjust it over a fresh
install. The copy ran most of the day, and was nearing completion but
then we acidentally resized the orig instance. :( We resized it back,
but the filesystem was messed up and the instance would no longer boot.
It was at this point we decided that lack of sleep could leed to poor
decisions and mistakes and we started a copy off of the data on the copy
to another freshly installed instance and went and got some sleep.

The next day, in a stroke of luck, the copy we were doing had already copied
all the disk that had data on it, so we were able to fsck it and resize
it and we were back in business. mailman/lists was back up this morning
and happily processing away. 

Today, in addition to finishing the above two migrations from yesterday,
we moved: 

* openqa. Right now it doesn't have any arm or power workers, but we
have some almost ready to go there that we should have in place next
week.

* Various openshift apps (docsbuilding, websites building, cron jobs,
etc). We even have release-monitoring and the new hotness up and
running. I am trying to bring koschei up as well, but it needs some more
work. 

* Some small misc apps: blockerbugs, kerneltest, etc. 

* We also fixed tons and tons of issues all over the map. Mostly around
things reaching other things or something not running for some
configuration reason.

At this point everything we planned to be in the minimal fedora should
be up and working. We do have a more capacity than we need, so if things
go smoothly without too many more things to fix, I'd like to see about
bringing up badges as it's a popular app and if we have capacity and can
easily do it we can bring it up. 

Tomorrow and this weekend we are going to work on taking things down in
the old datacenter and get them ready for shipping next week. They will
be in transit next week, then we hopefully can get them racked and built
and start adding capacity back the week after. 

So, if you notice something not working now, please do look to see if
there's already a ticket on it, and if not please file one. 
( https://pagure.io/fedora-infrastructure/issues ).

Overall things went pretty good from my view, and I would really like to
thank the awesome fedora community for being patient with us. I was
pretty surprised how few people asked why things were down and when they
did other community memebers were quick to tell them.

kevin
Attachment:
signature.asc

Description: PGP signature
_______________________________________________
infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@xxxxxxxxxxxxxxxxxxxxxxx