build infrastructure issues today

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Earlier today (at about 7am EST) the infrastructure provider restarted
a few of our core instances which took about half an hour to get into
an operational state again.

Even though these instances are load balanced and have replication,
they service (shaman.ceph.com) couldn't survive having all of its
instances restarted.

After they recovered and started serving content again, the databases
that keep all the build information where in read-only mode. The
primary database was not operational and the hard-restart had caused
it not to come back up properly.

This situation went unnoticed because the primary database is the one
in charge of writes, so builds couldn't see failures until very late
in the process when trying to update the build dashboard (a write
operation).

The databases and the service is fully operational now, but this means
that most of the builds today (up until 2:30pm EST) didn't complete.
In order to re-trigger them one has to produce new commits (if pushing
to ceph-ci.git) or simply:  `git commit --amend --reset-author` and
then re-push.

To further prevent this from going unnoticed, a health check will be
added to ensure this problem is caught early.

Sorry for the issues.

-Alfredo



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux