Planned Outage this Wed August 19

David Galloway <dgallowa@xxxxxxxxxx> · Mon, 17 Aug 2020 14:24:10 -0400

As you may know, the Sepia Long Running Cluster has been hitting
capacity limits over the past week or so.  This has resulted in service
disruptions to teuthology runs, chacra.ceph.com,
docker-mirror.front.sepia.ceph.com, and quay.ceph.io.

We've been able to get by by deleting/compressing logs more aggressively
but it's not ideal or sustainable.

Patrick has created a new erasure coded pool/filesystem that will allow
us to keep the same amount of logs but use less space.  In order to have
teuthology workers start writing logs to that pool, we need to take an
outage.

At 0400 UTC 19AUG2020, I will instruct all teuthology workers to die
after their running jobs finish.  At 1300 UTC, I will kill any jobs that
are still running.  This gives the lab 9 hours to gracefully shut down.

At that point, we will switch the mountpoint on teuthology.front over to
the new EC pool and start storing new logs there.

At the same time, Patrick will start migrating logs on the existing/old
pool to the new pool.  This means that logs from 7/20 through 8/19 will
be unavailable (you'll see 404s) via the Pulpito web UI and qa-proxy
URLs until they're migrated to the new EC pool.

Let me know if you have any questions/concerns.

Thanks,
-- 
David Galloway
Systems Administrator, RDU
Ceph Engineering
IRC: dgalloway
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx