Ceph Community Infrastructure Outage

Mike Perez <miperez@xxxxxxxxxx> · Tue, 17 Jan 2023 13:30:55 -0800

Hi everyone,

>From November into January, we experienced a series of outages with the
Ceph Community Infrastructure and its services:

   -

   Mailing lists
   -

      https://lists.ceph.io
      -

   Sepia (testing infrastructure)
   -

      https://wiki.sepia.ceph.com
      -

      https://pulpito.ceph.com
      -

      https://chacra.ceph.com
      -

      https://shaman.ceph.com
      -

      VPN to access testing services
      -

   Etherpad
   -

      https://pad.ceph.com
      -

   Images:
   -

      https://quay.ceph.io
      -

   Git mirror
   -

      https://git.ceph.com
      -

   https://ceph.io
   -

   Telemetry <https://telemetry-public.ceph.com/>

These services are now mostly restored, but we did experience some data
loss, notably in our mailing lists. We have restored them from backups, but
subscription changes after July 2021 need to be repeated. If you subscribed
or unsubscribed since then, please check your settings with the appropriate
list at https://lists.ceph.io. If your posts to our mailing lists are now
needing approval, that is also an indication that you need to re-subscribe
to the appropriate lists.

Keep an eye out for emails with subject lines such as “Your message to
ceph-users@xxxxxxx awaits moderator approval”.

When the community infrastructure was first created in late 2014, the VM
cluster management software selected by the team came with the benefit of
being widely entrenched and familiar to the lab administrators but didn't
support Ceph as a storage backend at the time. As services grew, we relied
more and more on its legacy storage solution, which was never migrated to
Ceph. Over the last few months, this legacy storage solution had several
instances of silent data corruption, rendering the VMs unbootable, taking
down various services, and requiring restoration from backups in many cases.

We are moving these services to a more reliable, mostly container-based,
infrastructure backed by Ceph, and planning for longer-term improvements to
monitoring, backups, deployment, and other pieces of the project
infrastructure.

This event highlights the need to better support the infrastructure. A
handful of contributors have stepped up to restore these services, but we
need an invested team focused.

If you or your company is looking for a great way to contribute to the Ceph
community, this could be your opportunity. Please contact council@xxxxxxx
if you can provide time to contribute to the Ceph Community Infrastructure
and would like to join the team. You can also join the upstream #sepia
slack channel to participate in these discussions using this link:
https://join.slack.com/t/ceph-storage/shared_invite/zt-1n1eh6po5-PF9sokUSooOf1ZkVdqrPUQ

Unfortunately, these events have slowed down our upstream development and
releases. We are currently working on publishing the next Pacific point
release. The development freeze and release deadline for the Reef release
will likely be pushed out, and more discussions to follow in the Ceph
Leadership Team meetings.

- The Ceph Leadership Team
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx