build infrastructure issues today

Alfredo Deza <adeza@xxxxxxxxxx> · Fri, 11 Jan 2019 15:17:27 -0500

Hi,

Earlier today (at about 7am EST) the infrastructure provider restarted
a few of our core instances which took about half an hour to get into
an operational state again.

Even though these instances are load balanced and have replication,
they service (shaman.ceph.com) couldn't survive having all of its
instances restarted.

After they recovered and started serving content again, the databases
that keep all the build information where in read-only mode. The
primary database was not operational and the hard-restart had caused
it not to come back up properly.

This situation went unnoticed because the primary database is the one
in charge of writes, so builds couldn't see failures until very late
in the process when trying to update the build dashboard (a write
operation).

The databases and the service is fully operational now, but this means
that most of the builds today (up until 2:30pm EST) didn't complete.
In order to re-trigger them one has to produce new commits (if pushing
to ceph-ci.git) or simply:  `git commit --amend --reset-author` and
then re-push.

To further prevent this from going unnoticed, a health check will be
added to ensure this problem is caught early.

Sorry for the issues.

-Alfredo